How to use random forests in R with missing values?

library(randomForest) rf.model <- randomForest(WIN ~ ., data = learn) 

I would like to match the random forest model, but I get this error:

 Error in na.fail.default(list(WIN = c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, : missing values in object 

I have a data frame with 16 numeric attributes and WIN is a factor with levels 0 1.

+68
r machine-learning random-forest missing-data na
Dec 03 '11 at 19:44
source share
3 answers

My initial response to this question was that he did not show much research effort, since "everyone" knows that random forests do not handle the missing values ​​in the predictors. But after checking ?randomForest I have to admit that it can be much more explicit.

(Although the Breiman PDF , linked in the documentation, clearly indicates that missing values ​​are simply not processed at all.)

The only obvious clue in the official documentation that I could see was that the default value for the na.action parameter is na.fail , which might be too cryptic for new users.

In any case, if your predictors are missing values, you have (basically) two options:

  • Use a different tool ( rpart handles missing values ​​well.)
  • Lose missing values

Not surprisingly, the randomForest package has a function for this, rfImpute . The documentation on ?rfImpute goes through a basic example of its use.

If only a small number of cases have missing values, you can also try setting na.action = na.omit to simply remove these cases.

And of course, this answer is a little hunch that your problem is really just missing.

+110
Dec 04 '11 at 2:10
source share

If it is likely that the missing values ​​are informative, you can enter the missing values ​​and add additional binary variables (using new.vars<-is.na(your_dataset) ) and check if it reduces the error if new.var too large, to add it to your_dataset then you can use it alone, select the significant variables with varImpPlot and add them to your_dataset , you can also try adding one variable to your_dataset that counts the number NA new.var <- rowSums(new.vars)

This is not an answer to the question if the missing variables are informative accounting that can be corrected to increase the model error due to an ineffective imputation procedure.

The missing values ​​are informative, then they arise due to nonrandom reasons, it is often found in the settings of social experiments.

+3
Feb 19 '17 at 13:01
source share

The random Braiman forest on which the randomForest package is based actually processes the missing values in the predictors. In randomForest package you can install

  na.action = na.roughfix 

It will start by using the median / mode for the missing values, but then it will grow the forest and calculate the proximity, then iterate and build the forest using these newly filled values, etc. This is not well explained in the randomForest documentation. It only claims

.... NAs are replaced with median columns .... This is used as a starting point for assigning missing values ​​to a random forest

You can find a bit more information on Braiman’s homepage.

missfill = 1.2 performs quick replacement of missing values ​​for the training set (if equal to 1) and a more accurate replacement (if equal to 2).

mfixrep = k with missfill = 2 makes a slower, but usually more efficient replacement, using proximity with k iterations only on the training set. (Requires nprox> 0).

0
Jul 08 '19 at 14:21
source share



All Articles