Boruta.R 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330
  1. # Core of Boruta
  2. #' @export
  3. #' @rdname Boruta
  4. Boruta<-function(x,...)
  5. UseMethod("Boruta")
  6. #' Feature selection with the Boruta algorithm
  7. #'
  8. #' Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest.
  9. #' The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise that test.
  10. #' @rdname Boruta
  11. #' @method Boruta default
  12. #' @param x data frame of predictors.
  13. #' @param y response vector; factor for classification, numeric vector for regression, \code{Surv} object for survival (supports depends on importance adapter capabilities).
  14. #' @param getImp function used to obtain attribute importance.
  15. #' The default is getImpRfZ, which runs random forest from the \code{ranger} package and gathers Z-scores of mean decrease accuracy measure.
  16. #' It should return a numeric vector of a size identical to the number of columns of its first argument, containing importance measure of respective attributes.
  17. #' Any order-preserving transformation of this measure will yield the same result.
  18. #' It is assumed that more important attributes get higher importance. +-Inf are accepted, NaNs and NAs are treated as 0s, with a warning.
  19. #' @param pValue confidence level. Default value should be used.
  20. #' @param mcAdj if set to \code{TRUE}, a multiple comparisons adjustment using the Bonferroni method will be applied. Default value should be used; older (1.x and 2.x) versions of Boruta were effectively using \code{FALSE}.
  21. #' @param maxRuns maximal number of importance source runs.
  22. #' You may increase it to resolve attributes left Tentative.
  23. #' @param holdHistory if set to \code{TRUE}, the full history of importance is stored and returned as the \code{ImpHistory} element of the result.
  24. #' Can be used to decrease a memory footprint of Boruta in case this side data is not used, especially when the number of attributes is huge; yet it disables plotting of such made \code{Boruta} objects and the use of the \code{\link{TentativeRoughFix}} function.
  25. #' @param doTrace verbosity level. 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means the same as 1, plus reporting each importance source run, 3 means the same as 2, plus reporting of hits assigned to yet undecided attributes.
  26. #' @param ... additional parameters passed to \code{getImp}.
  27. #' @return An object of class \code{Boruta}, which is a list with the following components:
  28. #' \item{finalDecision}{a factor of three value: \code{Confirmed}, \code{Rejected} or \code{Tentative}, containing final result of feature selection.}
  29. #' \item{ImpHistory}{a data frame of importances of attributes gathered in each importance source run.
  30. #' Beside predictors' importances, it contains maximal, mean and minimal importance of shadow attributes in each run.
  31. #' Rejected attributes get \code{-Inf} importance.
  32. #' Set to \code{NULL} if \code{holdHistory} was given \code{FALSE}.}
  33. #' \item{timeTaken}{time taken by the computation.}
  34. #' \item{impSource}{string describing the source of importance, equal to a comment attribute of the \code{getImp} argument.}
  35. #' \item{call}{the original call of the \code{Boruta} function.}
  36. #' @details Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones.
  37. #' Attributes that have significantly worst importance than shadow ones are being consecutively dropped.
  38. #' On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed.
  39. #' Shadows are re-created in each iteration.
  40. #' Algorithm stops when only Confirmed attributes are left, or when it reaches \code{maxRuns} importance source runs.
  41. #' If the second scenario occurs, some attributes may be left without a decision.
  42. #' They are claimed Tentative.
  43. #' You may try to extend \code{maxRuns} or lower \code{pValue} to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge.
  44. #' Instead, you can use \code{\link{TentativeRoughFix}} function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.
  45. #' @references Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package.
  46. #' \emph{Journal of Statistical Software, 36(11)}, p. 1-13.
  47. #' URL: \url{http://www.jstatsoft.org/v36/i11/}
  48. #' @export
  49. #' @examples
  50. #' set.seed(777)
  51. #'
  52. #' #Boruta on the "small redundant XOR" problem; read ?srx for details
  53. #' data(srx)
  54. #' Boruta(Y~.,data=srx)->Boruta.srx
  55. #'
  56. #' #Results summary
  57. #' print(Boruta.srx)
  58. #'
  59. #' #Result plot
  60. #' plot(Boruta.srx)
  61. #'
  62. #' #Attribute statistics
  63. #' attStats(Boruta.srx)
  64. #'
  65. #' #Using alternative importance source, rFerns
  66. #' Boruta(Y~.,data=srx,getImp=getImpFerns)->Boruta.srx.ferns
  67. #' print(Boruta.srx.ferns)
  68. #'
  69. #' #Verbose
  70. #' Boruta(Y~.,data=srx,doTrace=2)->Boruta.srx
  71. #'
  72. #' \dontrun{
  73. #' #Boruta on the iris problem extended with artificial irrelevant features
  74. #' #Generate said features
  75. #' iris.extended<-data.frame(iris,apply(iris[,-5],2,sample))
  76. #' names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="")
  77. #' #Run Boruta on this data
  78. #' Boruta(Species~.,data=iris.extended,doTrace=2)->Boruta.iris.extended
  79. #' #Nonsense attributes should be rejected
  80. #' print(Boruta.iris.extended)
  81. #' }
  82. #'
  83. #' \dontrun{
  84. #' #Boruta on the HouseVotes84 data from mlbench
  85. #' library(mlbench); data(HouseVotes84)
  86. #' na.omit(HouseVotes84)->hvo
  87. #' #Takes some time, so be patient
  88. #' Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo
  89. #' print(Bor.hvo)
  90. #' plot(Bor.hvo)
  91. #' plotImpHistory(Bor.hvo)
  92. #' }
  93. #' \dontrun{
  94. #' #Boruta on the Ozone data from mlbench
  95. #' library(mlbench); data(Ozone)
  96. #' library(randomForest)
  97. #' na.omit(Ozone)->ozo
  98. #' Boruta(V4~.,data=ozo,doTrace=2)->Bor.ozo
  99. #' cat('Random forest run on all attributes:\n')
  100. #' print(randomForest(V4~.,data=ozo))
  101. #' cat('Random forest run only on confirmed attributes:\n')
  102. #' print(randomForest(ozo[,getSelectedAttributes(Bor.ozo)],ozo$V4))
  103. #' }
  104. #' \dontrun{
  105. #' #Boruta on the Sonar data from mlbench
  106. #' library(mlbench); data(Sonar)
  107. #' #Takes some time, so be patient
  108. #' Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son
  109. #' print(Bor.son)
  110. #' #Shows important bands
  111. #' plot(Bor.son,sort=FALSE)
  112. #' }
  113. Boruta.default<-function(x,y,pValue=0.01,mcAdj=TRUE,maxRuns=100,doTrace=0,holdHistory=TRUE,getImp=getImpRfZ,...){
  114. #Timer starts... now!
  115. timeStart<-Sys.time()
  116. #Extract the call to store in output
  117. cl<-match.call()
  118. cl[[1]]<-as.name('Boruta')
  119. #Convert x into a data.frame
  120. if(!is.data.frame(x))
  121. x<-data.frame(x)
  122. ##Some checks on x & y
  123. if(length(grep('^shadow',names(x)))>0)
  124. stop('Attributes with names starting from "shadow" are reserved for internal use. Please rename them.')
  125. if(any(c(is.na(x),is.na(y))))
  126. stop('Cannot process NAs in input. Please remove them.')
  127. if(maxRuns<11)
  128. stop('maxRuns must be greater than 10.')
  129. ##Expands the information system with newly built random attributes and calculates importance
  130. addShadowsAndGetImp<-function(decReg,runs){
  131. #xSha is going to be a data frame with shadow attributes; time to init it.
  132. xSha<-x[,decReg!="Rejected",drop=F]
  133. while(dim(xSha)[2]<5) xSha<-cbind(xSha,xSha); #There must be at least 5 random attributes.
  134. #Now, we permute values in each attribute
  135. nSha<-ncol(xSha)
  136. data.frame(lapply(xSha,sample))->xSha
  137. names(xSha)<-paste('shadow',1:nSha,sep="")
  138. #Notifying user of our progress
  139. if(doTrace>1)
  140. message(sprintf(' %s. run of importance source...',runs))
  141. #Calling importance source; "..." can be used by the user to pass rf attributes (for instance ntree)
  142. impRaw<-getImp(cbind(x[,decReg!="Rejected"],xSha),y,...)
  143. if(!is.numeric(impRaw))
  144. stop("getImp result is not a numeric vector. Please check the given getImp function.")
  145. if(length(impRaw)!=sum(decReg!="Rejected")+ncol(xSha))
  146. stop("getImp result has a wrong length. Please check the given getImp function.")
  147. if(any(is.na(impRaw)|is.nan(impRaw))){
  148. impRaw[is.na(impRaw)|is.nan(impRaw)]<-0
  149. warning("getImp result contains NA(s) or NaN(s); replacing with 0(s), yet this is suspicious.")
  150. }
  151. #Importance must have Rejected attributes put on place and filled with -Infs
  152. imp<-rep(-Inf,nAtt+nSha);names(imp)<-c(attNames,names(xSha))
  153. impRaw->imp[c(decReg!="Rejected",rep(TRUE,nSha))]
  154. shaImp<-imp[(nAtt+1):length(imp)];imp[1:nAtt]->imp
  155. return(list(imp=imp,shaImp=shaImp))
  156. }
  157. ##Assigns hits
  158. assignHits<-function(hitReg,curImp){
  159. curImp$imp>max(curImp$shaImp)->hits
  160. if(doTrace>2){
  161. uncMask<-decReg=="Tentative"
  162. intHits<-sum(hits[uncMask])
  163. if(intHits>0)
  164. message(sprintf("Assigned hit to %s attribute%s out of %s undecided.",sum(hits[uncMask]),if(intHits==1) "" else "s",sum(uncMask)))
  165. else
  166. message("None of undecided attributes scored a hit.")
  167. }
  168. hitReg[hits]<-hitReg[hits]+1
  169. return(hitReg)
  170. }
  171. ##Checks whether number of hits is significant
  172. doTests<-function(decReg,hitReg,runs){
  173. pAdjMethod<-ifelse(mcAdj[1],'bonferroni','none')
  174. #If attribute is significantly more frequent better than shadowMax, its claimed Confirmed
  175. toAccept<-stats::p.adjust(stats::pbinom(hitReg-1,runs,0.5,lower.tail=FALSE),method=pAdjMethod)<pValue
  176. (decReg=="Tentative" & toAccept)->toAccept
  177. #If attribute is significantly more frequent worse than shadowMax, its claimed Rejected (=irrelevant)
  178. toReject<-stats::p.adjust(stats::pbinom(hitReg,runs,0.5,lower.tail=TRUE),method=pAdjMethod)<pValue
  179. (decReg=="Tentative" & toReject)->toReject
  180. #Update decReg
  181. decReg[toAccept]<-"Confirmed";"Rejected"->decReg[toReject]
  182. #Report progress
  183. if(doTrace>0){
  184. nAcc<-sum(toAccept)
  185. nRej<-sum(toReject)
  186. nLeft<-sum(decReg=="Tentative")
  187. if(nAcc+nRej>0)
  188. message(sprintf("After %s iterations, +%s: ",runs,format(difftime(Sys.time(),timeStart),digits=2)))
  189. if(nAcc>0)
  190. message(sprintf(" confirmed %s attribute%s: %s",
  191. nAcc,ifelse(nAcc==1,'','s'),.attListPrettyPrint(attNames[toAccept])))
  192. if(nRej>0)
  193. message(sprintf(" rejected %s attribute%s: %s",
  194. nRej,ifelse(nRej==1,'','s'),.attListPrettyPrint(attNames[toReject])))
  195. if(nAcc+nRej>0)
  196. if(nLeft>0){
  197. message(sprintf(" still have %s attribute%s left.\n",
  198. nLeft,ifelse(nLeft==1,'','s')))
  199. }else{
  200. if(nAcc+nRej>0) message(" no more attributes left.\n")
  201. }
  202. }
  203. return(decReg)
  204. }
  205. ##Creating some useful constants
  206. nAtt<-ncol(x); nrow(x)->nObjects
  207. attNames<-names(x); c("Tentative","Confirmed","Rejected")->confLevels
  208. ##Initiate state
  209. decReg<-factor(rep("Tentative",nAtt),levels=confLevels)
  210. hitReg<-rep(0,nAtt);names(hitReg)<-attNames
  211. impHistory<-list()
  212. runs<-0
  213. ##Main loop
  214. while(any(decReg=="Tentative") && (runs+1->runs)<maxRuns){
  215. curImp<-addShadowsAndGetImp(decReg,runs)
  216. hitReg<-assignHits(hitReg,curImp)
  217. decReg<-doTests(decReg,hitReg,runs)
  218. #If needed, update impHistory with scores obtained in this iteration
  219. if(holdHistory){
  220. imp<-c(curImp$imp,
  221. shadowMax=max(curImp$shaImp),
  222. shadowMean=mean(curImp$shaImp),
  223. shadowMin=min(curImp$shaImp))
  224. impHistory<-c(impHistory,list(imp))
  225. }
  226. }
  227. ##Building result
  228. impHistory<-do.call(rbind,impHistory)
  229. names(decReg)<-attNames
  230. ans<-list(finalDecision=decReg,ImpHistory=impHistory,
  231. pValue=pValue,maxRuns=maxRuns,light=TRUE,mcAdj=mcAdj,
  232. timeTaken=Sys.time()-timeStart,roughfixed=FALSE,call=cl,
  233. impSource=comment(getImp))
  234. "Boruta"->class(ans)
  235. return(ans)
  236. }
  237. .attListPrettyPrint<-function(x,limit=5){
  238. x<-sort(x)
  239. if(length(x)<limit+1)
  240. return(sprintf("%s;",paste(x,collapse=", ")))
  241. sprintf("%s and %s more;",paste(utils::head(x,limit),collapse=", "),length(x)-limit)
  242. }
  243. #' @rdname Boruta
  244. #' @method Boruta formula
  245. #' @param formula alternatively, formula describing model to be analysed.
  246. #' @param data in which to interpret formula.
  247. #' @export
  248. Boruta.formula<-function(formula,data=.GlobalEnv,...){
  249. ##Grab and interpret the formula
  250. stats::terms.formula(formula,data=data)->t
  251. x<-eval(attr(t,"variables"),data)
  252. apply(attr(t,"factors"),1,sum)>0->sel
  253. nam<-rownames(attr(t,"factors"))[sel]
  254. data.frame(x[sel])->df;names(df)<-nam
  255. x[[attr(t,"response")]]->dec
  256. ##Run Boruta
  257. ans<-Boruta.default(df,dec,...)
  258. ans$call<-match.call()
  259. ans$call[[1]]<-as.name('Boruta')
  260. formula->ans$call[["formula"]]
  261. return(ans)
  262. }
  263. #' Print Boruta object
  264. #'
  265. #' Print method for the Boruta objects.
  266. #' @method print Boruta
  267. #' @param x an object of a class Boruta.
  268. #' @param ... additional arguments passed to \code{\link{print}}.
  269. #' @return Invisible copy of \code{x}.
  270. #' @export
  271. print.Boruta<-function(x,...){
  272. if(class(x)!='Boruta') stop("This is NOT a Boruta object!")
  273. cat(paste('Boruta performed ',dim(x$ImpHistory)[1],' iterations in ',format(x$timeTaken),'.\n',sep=''))
  274. if(x$roughfixed) cat(paste('Tentatives roughfixed over the last ',x$averageOver,' iterations.\n',sep=''))
  275. if(sum(x$finalDecision=='Confirmed')==0){
  276. cat(' No attributes deemed important.\n')} else {
  277. writeLines(strwrap(paste(sum(x$finalDecision=='Confirmed'),' attributes confirmed important: ',
  278. .attListPrettyPrint(names(x$finalDecision[x$finalDecision=='Confirmed']))),indent=1))
  279. }
  280. if(sum(x$finalDecision=='Rejected')==0){
  281. cat(' No attributes deemed unimportant.\n')} else {
  282. writeLines(strwrap(paste(sum(x$finalDecision=='Rejected'),' attributes confirmed unimportant: ',
  283. .attListPrettyPrint(names(x$finalDecision[x$finalDecision=='Rejected']))),indent=1))
  284. }
  285. if(sum(x$finalDecision=='Tentative')!=0){
  286. writeLines(strwrap(paste(sum(x$finalDecision=='Tentative'),' tentative attributes left: ',
  287. .attListPrettyPrint(names(x$finalDecision[x$finalDecision=='Tentative']))),indent=1))
  288. }
  289. invisible(x)
  290. }
  291. #' Small redundant XOR data
  292. #'
  293. #' A synthetic data set with 32 rows corresponding to all combinations of values of five logical features, A, B, N1, N2 and N3.
  294. #' The decision Y is equal to A xor B, hence N1--N3 are irrelevant attributes.
  295. #' The set also contains 3 additional features, A or B (AoB), A and B (AnB) and not A (nA), which provide a redundant, but still relevant way to reconstruct Y.
  296. #'
  297. #' This is set is an easy way to demonstrate the difference between all relevant feature selection methods, which should select all features except N1--N3, and minimal optimal ones, which will probably ignore most of them.
  298. #' @format A data frame with 8 predictors, 4 relevant: A, B, AoB, AnB and nA, as well as 3 irrelevant N1, N2 and N3, and decision attribute Y.
  299. #' @source \url{https://mbq.me/blog/relevance-and-redundancy}
  300. "srx"