3 Commits 848932cadf ... debcacff90

Author SHA1 Message Date
  Miron B. Kursa debcacff90 Merge branch 'vig' into devel 2 years ago
  Miron Bartosz Kursa 84d6fd4eed Vignette updates 2 years ago
  Miron Bartosz Kursa dfa32ca4b2 Merge branch 'devel' into vig 2 years ago
2 changed files with 18 additions and 13 deletions
  1. 8 9
      vignettes/inahurry.Rnw
  2. 10 4
      vignettes/vig.bib

+ 8 - 9
vignettes/inahurry.Rnw

@@ -12,7 +12,7 @@
 \setkeys{Gin}{width=\textwidth}
 
 \section{Overview}
-Boruta is a feature selection method; that is, it expects a standard information system you'd fed to a classifier, and judges which of the features are important and which are not.
+Boruta \cite{Kursa2010} is a feature selection method; that is, it expects a standard information system you'd fed to a classifier, and judges which of the features are important and which are not.
 Let's try it with a sample dataset, say \texttt{iris}.
 To make things interesting, we will add some nonsense features to see if they get filtered out; to this end, we randomly mix the order of elements in each of the original features, wiping out its interaction with the decision, \texttt{iris\$Species}.
 
@@ -51,14 +51,14 @@ irisR<-cbind(
 Boruta(Species~.,data=irisR)
 @
 
-We see that \texttt{SpoilerFeature} has not removed any of the original features, despite it made them fully redundant.
+We see that \texttt{SpoilerFeature} has not supplanted any of the original features, despite making them fully redundant.
 One may wonder, however, how came anyone would need something which is clearly redundant?
 There are basically three reasons behind this:
 \begin{itemize}
  \item One may perform feature selection for an insight in which aspects of the phenomenon in question are important are which are not.
  In such case subtle effects possess substantial explanatory value, even if they are masked by stronger interactions.
  \item In some sets, especially of a $p\gg n$ class, nonsense features may have spurious correlations with the decision, arisen purely by chance.
- Such interactions may rival or even be stronger than the actual mechanisms of the underlying phenomenon, making them apparently redundant.
+ Such interactions may rival or even be stronger than actual mechanisms of the underlying phenomenon, making them apparently redundant.
  All relevant approach won't magically help distinguish both, but will better preserve true patterns.
  \item Minimal optimal methods are generally cherry-picking features usable for classification, regardless if this usability is significant or not, which is an easy way to overfitting.
  Boruta is much more robust in this manner.
@@ -68,7 +68,7 @@ There are basically three reasons behind this:
 
 Under the hood, Boruta uses feature importance scores which are provided by certain machine learning methods; in particular Random Forest \cite{Breiman2001}, which happens to be used by default (using the \texttt{ranger} package \cite{Wright2015} implementation).
 Such scores only contribute to the ranking of features, though --- to separate relevant features, we need some reference of what is a distribution of importance of an irrelevant feature.
-To this end, Boruta uses \textit{shadow features} or \textit{shadows}, which are copies of original features but with randomly mixed values, so that their distribution remains the same yet importance is wiped out.
+To this end, Boruta uses \textit{shadow features} or \textit{shadows}, which are copies of original features but with randomly mixed values, so that their distribution remains the same yet their importance is wiped out.
 
 \begin{figure}
 <<BorutaPlots,fig=TRUE,echo=FALSE,results=hide,width=10,height=5>>=
@@ -79,10 +79,10 @@ plotImpHistory(BorutaOnIrisE)
 \caption{\label{fig:plots} The result of calling \texttt{plot} (left) and \texttt{plotImpHistory} (right) on the \texttt{BorutaOnIrisE} object.}
 \end{figure}
 
-Because the importance scoring is often stochastic and can be degraded due to a presence of shadows, the Boruta selection is a process.
+As importance scoring is often stochastic and can be degraded due to a presence of shadows, the Boruta selection is a process.
 In each iteration, first shadows are generated, and such extended dataset is fed to an importance provider.
 Original features' importance is then compared with the highest importance of a shadow; and these which score higher are given a \textit{hit}.
-Accumulated hit counts are finally assessed; features which significantly outperform best shadow are claimed confirmed, which these which significantly underperform best shadow are claimed rejected and removed from the set for subsequent iterations.
+Accumulated hit counts are finally assessed; features which significantly outperform best shadow are claimed confirmed, which these which significantly underperform best shadow are claimed rejected and removed from the set for all subsequent iterations.
 
 The algorithm stops when all features have an established decision, or when a pre-set maximal number of iterations (100 by default) is exhausted.
 In the latter case, the remaining features are claimed \textit{tentative}.
@@ -91,7 +91,6 @@ The process can be observed live with \texttt{doTrace} argument set to 1 (report
 The graphical summary of a run can be obtained using \texttt{plot} and \texttt{plotImpHistory} on the Boruta result object, as shown on Figure~\ref{fig:plots} for the extended iris example.
 First function uses boxplots to show the distribution of features' importance over Boruta run, using colours to mark final decision; it also draws boxplots for the importance of worst, average and best shadow in each iterations (in blue).
 Second function visualises the same data, but as a function of the iteration number.
-
 The summary of feature importance and hit counts can be extracted using the \texttt{attStats} convenience function.
 
 <<attStats>>=
@@ -121,10 +120,10 @@ Few things worth noting before using Boruta in production:
 \begin{itemize}
 \item Boruta is a heuristic; there are no strict guarantees about its output.
 Whenever possible, try to assess its results, especially in terms of selection stability as classification accuracy may be deceiving \cite{Kursa2014}.
-\item For datasets with lots of features, the default configuration of the importance source is likely insufficient; in the particular case of Random Forest often the number of trees is not enough to allow the importance scores to stabilise, which in turn often leads to false negatives and unstable results.
+\item For datasets with lots of features, the default configuration of the importance source is likely insufficient; in the particular case of Random Forest the number of trees is often not large enough to allow the importance scores to stabilise, which in turn often leads to false negatives and unstable results.
 \item Boruta is a strictly serial algorithm, and spends most time waiting for the importance provider --- hence, tweaking this element brings best chance to speed up the selection.
 If speed is a concern, one should also avoid the formula interface and directly pass predictor and decision parts of the information system.
-\item Elimination of tentative features becomes practically impossible if the turn out to have very similar distribution as the best shadow, and the presence of such does not make the overall Boruta result useless.
+\item Elimination of tentative features becomes practically impossible if they turn out to have very similar importance distribution to the best shadow, and the presence of such does not make the overall Boruta result useless.
 \item Importance history for bigger problems may take impractically huge amount of memory; hence its collection can be turned off with \texttt{holdHistory} argument of the \texttt{Boruta} function.
 This will disable some functionality, though, most notably plotting.
 \item Treatment of missing values and non-standard decision forms (like survival problems) depends on the capacity of the information source.

+ 10 - 4
vignettes/vig.bib

@@ -10,7 +10,6 @@ year = {2007}
 author = {Wright, Marvin N. and Ziegler, Andreas},
 journal = {Journal of Statistical Software},
 number = {1},
-pages = {1--17},
 title = {{ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R}},
 volume = {77},
 year = {2015}
@@ -27,17 +26,24 @@ year = {2001}
 @article{Kursa2014a,
 author = {Kursa, Miron B.},
 journal = {Journal of Statistical Software},
-keywords = {Classification,Machine learning,R,Random ferns},
 number = {10},
-title = {{rFerns : An Implementation of the Random Ferns Method for General-Purpose Machine Learning}},
+title = {{rFerns: An Implementation of the Random Ferns Method for General-Purpose Machine Learning}},
 volume = {61},
 year = {2014}
 }
 @article{Kursa2014,
-author = {Kursa, Miron Bartosz},
+author = {Kursa, Miron B.},
 journal = {BMC Bioinformatics},
 number = {1},
 title = {{Robustness of Random Forest-based gene selection methods}},
 volume = {15},
 year = {2014}
 }
+@article{Kursa2010,
+author = {Kursa, Miron B. and Rudnicki, Witold R.},
+journal = {Journal of Statistical Software},
+title = {Feature Selection with the Boruta Package},
+volume = {36},
+number = {11},
+year = {2010}
+}