123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135 |
- \documentclass{article}
- %\VignetteIndexEntry{Boruta for those in a hurry}
- \usepackage[utf8]{inputenc}
- \title{Boruta for those in a hurry}
- \author{Miron B. Kursa}
- \begin{document}
- \bibliographystyle{plain}
- \maketitle
- \setkeys{Gin}{width=\textwidth}
- \section{Overview}
- Boruta \cite{Kursa2010} is a feature selection method; that is, it expects a standard information system you'd fed to a classifier, and judges which of the features are important and which are not.
- Let's try it with a sample dataset, say \texttt{iris}.
- To make things interesting, we will add some nonsense features to see if they get filtered out; to this end, we randomly mix the order of elements in each of the original features, wiping out its interaction with the decision, \texttt{iris\$Species}.
- <<setGeneration>>=
- set.seed(17)
- data(iris)
- irisE<-cbind(
- setNames(
- data.frame(apply(iris[,-5],2,sample)),
- sprintf("Nonsense%d",1:4)
- ),
- iris
- )
- @
- \noindent Now, time for Boruta:
- <<Boruta>>=
- library(Boruta)
- Boruta(Species~.,data=irisE)->BorutaOnIrisE
- BorutaOnIrisE
- @
- As one can see, the method had \textit{rejected} nonsense features and \textit{confirmed} (retained) the original ones, as it was to be expected.
- What is important is that Boruta does a sharp classification of features rather than ordering, which is in contrast to many other feature selection methods.
- The other substantial difference is that Boruta is an \textit{all relevant} method, hence aims to find all features connected with the decision --- most other methods are of a \textit{minimal optimal} class, consequently aims to provide a possibly compact set of features which carry enough information for a possibly optimal classification on the reduced set \cite{Nilsson2007}.
- What does it mean in practice is that Boruta will include redundant features, that is ones which carry information already contained in other features.
- As an example, let's add a feature which contains all the information in the decision in a most accessible form --- namely, a copy of the decision, and push it into Boruta.
- <<BorutaReduendancy>>=
- irisR<-cbind(
- irisE,
- SpoilerFeature=iris$Species
- )
- Boruta(Species~.,data=irisR)
- @
- We see that \texttt{SpoilerFeature} has not supplanted any of the original features, despite making them fully redundant.
- One may wonder, however, how came anyone would need something which is clearly redundant?
- There are basically three reasons behind this:
- \begin{itemize}
- \item One may perform feature selection for an insight in which aspects of the phenomenon in question are important are which are not.
- In such case subtle effects possess substantial explanatory value, even if they are masked by stronger interactions.
- \item In some sets, especially of a $p\gg n$ class, nonsense features may have spurious correlations with the decision, arisen purely by chance.
- Such interactions may rival or even be stronger than actual mechanisms of the underlying phenomenon, making them apparently redundant.
- All relevant approach won't magically help distinguish both, but will better preserve true patterns.
- \item Minimal optimal methods are generally cherry-picking features usable for classification, regardless if this usability is significant or not, which is an easy way to overfitting.
- Boruta is much more robust in this manner.
- \end{itemize}
- \section{Mechanism}
- Under the hood, Boruta uses feature importance scores which are provided by certain machine learning methods; in particular Random Forest \cite{Breiman2001}, which happens to be used by default (using the \texttt{ranger} package \cite{Wright2015} implementation).
- Such scores only contribute to the ranking of features, though --- to separate relevant features, we need some reference of what is a distribution of importance of an irrelevant feature.
- To this end, Boruta uses \textit{shadow features} or \textit{shadows}, which are copies of original features but with randomly mixed values, so that their distribution remains the same yet their importance is wiped out.
- \begin{figure}
- <<BorutaPlots,fig=TRUE,echo=FALSE,results=hide,width=10,height=5>>=
- par(mfrow=c(1,2))
- plot(BorutaOnIrisE)
- plotImpHistory(BorutaOnIrisE)
- @
- \caption{\label{fig:plots} The result of calling \texttt{plot} (left) and \texttt{plotImpHistory} (right) on the \texttt{BorutaOnIrisE} object.}
- \end{figure}
- As importance scoring is often stochastic and can be degraded due to a presence of shadows, the Boruta selection is a process.
- In each iteration, first shadows are generated, and such extended dataset is fed to an importance provider.
- Original features' importance is then compared with the highest importance of a shadow; and these which score higher are given a \textit{hit}.
- Accumulated hit counts are finally assessed; features which significantly outperform best shadow are claimed confirmed, which these which significantly underperform best shadow are claimed rejected and removed from the set for all subsequent iterations.
- The algorithm stops when all features have an established decision, or when a pre-set maximal number of iterations (100 by default) is exhausted.
- In the latter case, the remaining features are claimed \textit{tentative}.
- The process can be observed live with \texttt{doTrace} argument set to 1 (report after each decision), 2 (report after each iteration) or 3 (also report hits); importances in each iteration are also stored in the \texttt{ImpHistory} element of the Boruta object.
- The graphical summary of a run can be obtained using \texttt{plot} and \texttt{plotImpHistory} on the Boruta result object, as shown on Figure~\ref{fig:plots} for the extended iris example.
- First function uses boxplots to show the distribution of features' importance over Boruta run, using colours to mark final decision; it also draws boxplots for the importance of worst, average and best shadow in each iterations (in blue).
- Second function visualises the same data, but as a function of the iteration number.
- The summary of feature importance and hit counts can be extracted using the \texttt{attStats} convenience function.
- <<attStats>>=
- attStats(BorutaOnIrisE)
- @
- \section{Importance sources}
- Building Random Forest multiple times on substantially enlarged dataset may easily become very time consuming, especially for larger sets for which using Boruta makes most sense.
- It is also possible that RF importance is not best suited to catch the information content of features because of the dataset specifics.
- Anyhow, Boruta allows you to switch importance source to an arbitrary function which gets an information system and returns a vector of numeric importance scores of all features.
- The package already includes adapters for several importance scorers and their various configurations; all start with a \texttt{getImp} prefix.
- In particular, there is one for rFerns \cite{Kursa2014a}, an implementation of random ferns, a purely stochastic ensemble classifier which can usually provide similar importance scores as Random Forest, but in substantially shorter time:
- <<BorutaFe>>=
- library(rFerns)
- Boruta(Species~.,data=irisE,getImp=getImpFerns)
- @
- You can pass arguments to the importance provider by providing it to the Boruta call; for instance, ranger, the default importance provider, makes use of all available CPU threads, won't always be the optimal choice.
- Setting \texttt{num.threads} in the Boruta call will cause it to relay this argument to the \texttt{ranger} function, and hence limit the training process parallelism.
- \section{Caveats}
- Few things worth noting before using Boruta in production:
- \begin{itemize}
- \item Boruta is a heuristic; there are no strict guarantees about its output.
- Whenever possible, try to assess its results, especially in terms of selection stability as classification accuracy may be deceiving \cite{Kursa2014}.
- \item For datasets with lots of features, the default configuration of the importance source is likely insufficient; in the particular case of Random Forest the number of trees is often not large enough to allow the importance scores to stabilise, which in turn often leads to false negatives and unstable results.
- \item Boruta is a strictly serial algorithm, and spends most time waiting for the importance provider --- hence, tweaking this element brings best chance to speed up the selection.
- If speed is a concern, one should also avoid the formula interface and directly pass predictor and decision parts of the information system.
- \item Elimination of tentative features becomes practically impossible if they turn out to have very similar importance distribution to the best shadow, and the presence of such does not make the overall Boruta result useless.
- \item Importance history for bigger problems may take impractically huge amount of memory; hence its collection can be turned off with \texttt{holdHistory} argument of the \texttt{Boruta} function.
- This will disable some functionality, though, most notably plotting.
- \item Treatment of missing values and non-standard decision forms (like survival problems) depends on the capacity of the information source.
- \item The original Boruta paper describes the 1.0 version, and algorithm has undergone substantial changes since then, namely the initial, warm-up rounds were removed, the multiple testing correction was introduced, finally the nomenclature has been clarified.
- \end{itemize}
- \bibliography{vig}
- \end{document}
|