Boruta.Rd 6.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
  1. % Generated by roxygen2: do not edit by hand
  2. % Please edit documentation in R/Boruta.R
  3. \name{Boruta}
  4. \alias{Boruta}
  5. \alias{Boruta.default}
  6. \alias{Boruta.formula}
  7. \title{Feature selection with the Boruta algorithm}
  8. \usage{
  9. Boruta(x, ...)
  10. \method{Boruta}{default}(x, y, pValue = 0.01, mcAdj = TRUE,
  11. maxRuns = 100, doTrace = 0, holdHistory = TRUE,
  12. getImp = getImpRfZ, ...)
  13. \method{Boruta}{formula}(formula, data = .GlobalEnv, ...)
  14. }
  15. \arguments{
  16. \item{x}{data frame of predictors.}
  17. \item{...}{additional parameters passed to \code{getImp}.}
  18. \item{y}{response vector; factor for classification, numeric vector for regression, \code{Surv} object for survival (supports depends on importance adapter capabilities).}
  19. \item{pValue}{confidence level. Default value should be used.}
  20. \item{mcAdj}{if set to \code{TRUE}, a multiple comparisons adjustment using the Bonferroni method will be applied. Default value should be used; older (1.x and 2.x) versions of Boruta were effectively using \code{FALSE}.}
  21. \item{maxRuns}{maximal number of importance source runs.
  22. You may increase it to resolve attributes left Tentative.}
  23. \item{doTrace}{verbosity level. 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means the same as 1, plus reporting each importance source run, 3 means the same as 2, plus reporting of hits assigned to yet undecided attributes.}
  24. \item{holdHistory}{if set to \code{TRUE}, the full history of importance is stored and returned as the \code{ImpHistory} element of the result.
  25. Can be used to decrease a memory footprint of Boruta in case this side data is not used, especially when the number of attributes is huge; yet it disables plotting of such made \code{Boruta} objects and the use of the \code{\link{TentativeRoughFix}} function.}
  26. \item{getImp}{function used to obtain attribute importance.
  27. The default is getImpRfZ, which runs random forest from the \code{ranger} package and gathers Z-scores of mean decrease accuracy measure.
  28. It should return a numeric vector of a size identical to the number of columns of its first argument, containing importance measure of respective attributes.
  29. Any order-preserving transformation of this measure will yield the same result.
  30. It is assumed that more important attributes get higher importance. +-Inf are accepted, NaNs and NAs are treated as 0s, with a warning.}
  31. \item{formula}{alternatively, formula describing model to be analysed.}
  32. \item{data}{in which to interpret formula.}
  33. }
  34. \value{
  35. An object of class \code{Boruta}, which is a list with the following components:
  36. \item{finalDecision}{a factor of three value: \code{Confirmed}, \code{Rejected} or \code{Tentative}, containing final result of feature selection.}
  37. \item{ImpHistory}{a data frame of importances of attributes gathered in each importance source run.
  38. Beside predictors' importances, it contains maximal, mean and minimal importance of shadow attributes in each run.
  39. Rejected attributes get \code{-Inf} importance.
  40. Set to \code{NULL} if \code{holdHistory} was given \code{FALSE}.}
  41. \item{timeTaken}{time taken by the computation.}
  42. \item{impSource}{string describing the source of importance, equal to a comment attribute of the \code{getImp} argument.}
  43. \item{call}{the original call of the \code{Boruta} function.}
  44. }
  45. \description{
  46. Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest.
  47. The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise that test.
  48. }
  49. \details{
  50. Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones.
  51. Attributes that have significantly worst importance than shadow ones are being consecutively dropped.
  52. On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed.
  53. Shadows are re-created in each iteration.
  54. Algorithm stops when only Confirmed attributes are left, or when it reaches \code{maxRuns} importance source runs.
  55. If the second scenario occurs, some attributes may be left without a decision.
  56. They are claimed Tentative.
  57. You may try to extend \code{maxRuns} or lower \code{pValue} to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge.
  58. Instead, you can use \code{\link{TentativeRoughFix}} function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.
  59. }
  60. \examples{
  61. set.seed(777)
  62. #Boruta on the "small redundant XOR" problem; read ?srx for details
  63. data(srx)
  64. Boruta(Y~.,data=srx)->Boruta.srx
  65. #Results summary
  66. print(Boruta.srx)
  67. #Result plot
  68. plot(Boruta.srx)
  69. #Attribute statistics
  70. attStats(Boruta.srx)
  71. #Using alternative importance source, rFerns
  72. Boruta(Y~.,data=srx,getImp=getImpFerns)->Boruta.srx.ferns
  73. print(Boruta.srx.ferns)
  74. #Versbose
  75. Boruta(Y~.,data=srx,doTrace=2)->Boruta.srx
  76. \dontrun{
  77. #Boruta on the iris problem extended with artificial irrelevant features
  78. #Generate said features
  79. iris.extended<-data.frame(iris,apply(iris[,-5],2,sample))
  80. names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="")
  81. #Run Boruta on this data
  82. Boruta(Species~.,data=iris.extended,doTrace=2)->Boruta.iris.extended
  83. #Nonsense attributes should be rejected
  84. print(Boruta.iris.extended)
  85. }
  86. \dontrun{
  87. #Boruta on the HouseVotes84 data from mlbench
  88. library(mlbench); data(HouseVotes84)
  89. na.omit(HouseVotes84)->hvo
  90. #Takes some time, so be patient
  91. Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo
  92. print(Bor.hvo)
  93. plot(Bor.hvo)
  94. plotImpHistory(Bor.hvo)
  95. }
  96. \dontrun{
  97. #Boruta on the Ozone data from mlbench
  98. library(mlbench); data(Ozone)
  99. library(randomForest)
  100. na.omit(Ozone)->ozo
  101. Boruta(V4~.,data=ozo,doTrace=2)->Bor.ozo
  102. cat('Random forest run on all attributes:\\n')
  103. print(randomForest(V4~.,data=ozo))
  104. cat('Random forest run only on confirmed attributes:\\n')
  105. print(randomForest(ozo[,getSelectedAttributes(Bor.ozo)],ozo$V4))
  106. }
  107. \dontrun{
  108. #Boruta on the Sonar data from mlbench
  109. library(mlbench); data(Sonar)
  110. #Takes some time, so be patient
  111. Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son
  112. print(Bor.son)
  113. #Shows important bands
  114. plot(Bor.son,sort=FALSE)
  115. }
  116. }
  117. \references{
  118. Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package.
  119. \emph{Journal of Statistical Software, 36(11)}, p. 1-13.
  120. URL: \url{http://www.jstatsoft.org/v36/i11/}
  121. }