This will delete the page
"FAQ". Please be certain.
It is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error. Here is a paper with the details.
For a start, when you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (yes, minimal optimal set of features by definition depends on your classifier choice).
So you also care about having a robust model; in p≫n problems, one can usually cherry-pick a nonsense subset of features which yields good or even perfect classification – minimal optimal methods can easily get deceived by that, leaving you with an overfitted model and no sign that something is wrong. See this or that for an example.
This does not work this way; Boruta is a feature selection method, not a feature ranking method. This way it is actually better, because selection is what you always need at the end of the day, and Boruta solves the problem of reasonable N in your top-N for you.
Sure; probably Boruta has a higher chance of taking into account multi-feature associations, without a computational and statistical-significance crunching impact of exhaustive searching for them. But it is not guaranteed to.
No, obviously; all relevant problem requires practically infinite number of objects / samples to be solved exactly, Boruta is only a humble, heuristic approximation. With a limited sample, some truly irrelevant feature may always become indistinguishable from a truly relevant ones by pure chance. Remember about proper testing, and that nothing is more effective than getting more, better quality data.
Those for which Boruta could not justify whether they are relevant or not. You can treat them as confirmed or rejected depending on the use case; also increasing the number of iterations help to reduce their number. In case you desperately want to get rid of them, R version has
Boruta::TentativeRoughFix function which applies pretty naive heuristic to re-classify them into confirmed and rejected based on how they scored during the Boruta run.
Great question! I like to use the self-consistency of the selection, i.e. analysis how often certain attributes are selected over many applications of a tested FS method on a somehow disturbed version of the training data, for instance bootstrapped (see this for an example).
Passing this test won't guarantee that the method is good, but failing it certainly shows that it is bad. The best option is obviously to compare with some gold side knowledge, but it may be hard or impossible to acquire or tricky to assess. If you really need to use classification accuracy, do a nested cross-validation.
It is actually pretty trivial, no magic at all. The core idea is that a feature that is not relevant is not more useful for classification than its version with a permuted order of values. To this end, Boruta extends the given dataset with such permuted copies of all features (which we call shadows), applies some feature importance measure (aka VIM; Boruta can use any, the default is Random Forest's MDA) and checks which features are better and which are worse than a most important shadow.
The next important idea is based on observation that VIMs get more accurate with less irrelevant features present; thus, Boruta applies the above test iteratively, constantly removing features which it strongly believes are irrelevant. The details are here.
Sad to hear that; but before moving to a different method, you may want to check whether the VIM you are using is stable, as it is a popular error. For instance, in the default state, R Boruta uses RF MDA importance with the ranger's default number of trees equal to 500 — this is a very small number, suitable only for datasets with a few features.
In case you are doing a benchmark by adding lots of noisy features to some dataset or just happen to have a p≫n problem, note that false positives are expected and inevitable due to how likely it is to generate random associations in such a set-up.
You may also want to check if you are describing your problem with features that have a potential to be relevant and are adjusted to the characteristics of your VIM source. Be sure that you don't have spoiler features which may uniquely index your objects, like timestamps or IDs.
Well, it is certainly not as fast as a simple correlation or a single VIM run, but remember that it is trying to solve a tough problem. However, ensure that you are not using the formula interface and that you have enough RAM. You may also wish to turn off model saving in your VIM (which is already done in Boruta's default VIM provider wrappers) and Boruta VIM history (
holdHistory=FALSE). Finally, you can try to use a faster VIM source, like for instance rFerns (also this), and/or a VIM that allows parallel computation (both R Boruta, since version 5.0, and Python Boruta by default use Random Forest MDA VIM provided by an RF implementation which can easily utilise multiple local cores; respectively ranger and scikit-learn). Note that it makes no sense to make a parallel version of Boruta on its own; it is strictly single-threaded and it won't change.
You may want to track Boruta progress live; in case of R version you can achieve this by setting
doTrace argument set to 1 or 2 (on Windows, you also have to turn off output buffering).
For the sake of this page as well as Boruta strings and docs, none.
Boruta is a Slavic spirit of the forest, and the first version of Boruta was a wrapper over the Random Forest method.
This will delete the page
"FAQ". Please be certain.