mbq edited this page 1 year ago

Boruta FAQ

Boruta is an all-relevant feature selection method, invented by Witold R. Rudnicki and developed by Miron B. Kursa at the ICM UW.

This repository contains the code of a reference implementation, which is an R package and lives on CRAN. There is also a Python implementation by Daniel Homola.

The method was introduced in this JSS paper, and is even used by other people.

So, what's so special about Boruta?

It is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error. Here is a paper with the details.

Why should I care?

For a start, when you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (yes, minimal optimal set of features by definition depends on your classifier choice).

But I only care about good classification accuracy!

So you also care about having a robust model; in p≫n problems, one can usually cherry-pick a nonsense subset of features which yields good or even perfect classification – minimal optimal methods can easily get deceived by that, leaving you with an overfitted model and no sign that something is wrong. See this or that for an example.

I want top-N best features, but this junk is only giving me M confirmed and K tentative, whatever that means?!

This does not work this way; Boruta is a feature selection method, not a feature ranking method. This way it is actually better, because selection is what you always need at the end of the day, and Boruta solves the problem of reasonable N in your top-N for you.

How is Boruta better than my favourite mutual-information / casual supercorrelation-based / dependency testing method? It is all relevant as well!

Sure; probably Boruta has a higher chance of taking into account multi-feature associations, without a computational and statistical-significance crunching impact of exhaustive searching for them. But it is not guaranteed to.

Is Boruta a silver bullet for all my feature selection needs?

No, obviously; all relevant problem requires practically infinite number of objects / samples to be solved exactly, Boruta is only a humble, heuristic approximation. With a limited sample, some truly irrelevant feature may always become indistinguishable from a truly relevant ones by pure chance. Remember about proper testing, and that nothing is more effective than getting more, better quality data.

What are tentative features?

Those for which Boruta could not justify whether they are relevant or not. You can treat them as confirmed or rejected depending on the use case; also increasing the number of iterations help to reduce their number. In case you desperately want to get rid of them, R version has Boruta::TentativeRoughFix function which applies pretty naive heuristic to re-classify them into confirmed and rejected based on how they scored during the Boruta run.

You say that accuracy of a model made on a selection of features is not a good indicator of its quality… How shall I assess feature selection then?

Great question! I like to use the self-consistency of the selection, i.e. analysis how often certain attributes are selected over many applications of a tested FS method on a somehow disturbed version of the training data, for instance bootstrapped (see this for an example).

Passing this test won't guarantee that the method is good, but failing it certainly shows that it is bad. The best option is obviously to compare with some gold side knowledge, but it may be hard or impossible to acquire or tricky to assess. If you really need to use classification accuracy, do a nested cross-validation.

So, how does it work? Is magic involved?

It is actually pretty trivial, no magic at all. The core idea is that a feature that is not relevant is not more useful for classification than its version with a permuted order of values. To this end, Boruta extends the given dataset with such permuted copies of all features (which we call shadows), applies some feature importance measure (aka VIM; Boruta can use any, the default is Random Forest's MDA) and checks which features are better and which are worse than a most important shadow.

The next important idea is based on observation that VIMs get more accurate with less irrelevant features present; thus, Boruta applies the above test iteratively, constantly removing features which it strongly believes are irrelevant. The details are here.

Boruta is wrong!

Sad to hear that; but before moving to a different method, you may want to check whether the VIM you are using is stable, as it is a popular error. For instance, in the default state, R Boruta uses RF MDA importance with the ranger's default number of trees equal to 500 — this is a very small number, suitable only for datasets with a few features.

In case you are doing a benchmark by adding lots of noisy features to some dataset or just happen to have a p≫n problem, note that false positives are expected and inevitable due to how likely it is to generate random associations in such a set-up.

You may also want to check if you are describing your problem with features that have a potential to be relevant and are adjusted to the characteristics of your VIM source. Be sure that you don't have spoiler features which may uniquely index your objects, like timestamps or IDs.

Boruta is slow!

Well, it is certainly not as fast as a simple correlation or a single VIM run, but remember that it is trying to solve a tough problem. However, ensure that you are not using the formula interface and that you have enough RAM. You may also wish to turn off model saving in your VIM (which is already done in Boruta's default VIM provider wrappers) and Boruta VIM history (holdHistory=FALSE). Finally, you can try to use a faster VIM source, like for instance rFerns (also this), and/or a VIM that allows parallel computation (both R Boruta, since version 5.0, and Python Boruta by default use Random Forest MDA VIM provided by an RF implementation which can easily utilise multiple local cores; respectively ranger and scikit-learn). Note that it makes no sense to make a parallel version of Boruta on its own; it is strictly single-threaded and it won't change.

You may want to track Boruta progress live; in case of R version you can achieve this by setting doTrace argument set to 1 or 2 (on Windows, you also have to turn off output buffering).

What is the difference between attribute, feature and variable?

For the sake of this page as well as Boruta strings and docs, none.

Why such a strange name?

Boruta is a Slavic spirit of the forest, and the first version of Boruta was a wrapper over the Random Forest method.