Getting forecasts after rfImpute

I am doing some modeling using the randomForest package. The rfImpute function rfImpute very good for handling missing values ​​when setting up the model. However, is there a way to get predictions for new cases that lack values?

The following is an example in ?rfImpute .

 iris.na <- iris set.seed(111) ## artificially drop some data values. for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA ## impute the dropped values set.seed(222) iris.imputed <- rfImpute(Species ~ ., iris.na) ## fit the model set.seed(333) iris.rf <- randomForest(Species ~ ., iris.imputed) # now try to predict for a case where a variable is missing > predict(iris.rf, iris.na[148, , drop=FALSE]) [1] <NA> Levels: setosa versicolor virginica 
+6
source share
2 answers

Four years and one company later ....

The rxDForest function that ships with Microsoft R Server / Client can receive predicted values ​​for cases with missing values. This is because rxDForest uses the same base code as rxDTree to set up single decision trees, and therefore benefits from the latter ability to create surrogate variables.

 iris.na <- iris set.seed(111) ## artificially drop some data values. for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA library(RevoScaleR) # rxDForest doesn't support dot-notation for formulas iris.rxf <- rxDForest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width, data=iris.na, nTree=100) pred <- rxPredict(iris.rxf, iris.na) # not predict() table(pred) # setosa versicolor virginica # 50 48 52 

(The answer from @alex keil, while inventive, is not very practical in production setup, because it requires refitting the model for each predictive call. With a decent sized dataset that can take minutes or hours.)

+2
source

This is probably not the clean solution you are looking for, but here is the way forward. The problem is twofold:

1) the value of the variables NA must be imputed on the basis of the same protocol of imputation at which the source data were created.

2) the result should be predicted based on this conditional value, but in accordance with the original random forest without new data.

1

Note the new observation on the imputed (rather than the original) data set (i.e. use the already imputed data that you already received) and attach the new missing values. The new value does not correspond to the imputed initial observation (it should not).

  iris.na2 = rbind (iris.imputed, iris.na [148,, drop = FALSE])
 iris.imputed2 = rfImpute (Species ~., iris.na2)
  >>> tail (iris.imputed, 3)
       Species Sepal.Length Sepal.Width Petal.Length Sepal.Width
 148 virginica 6.5 3.019279 5.2 2.0
 149 virginica 6.2 3.400000 5.4 2.3
 150 virginica 5.9 3.000000 5.1 1.8
 >>> tail (iris.imputed2,4)
        Species Sepal.Length Sepal.Width Petal.Length Sepal.Width
 148 virginica 6.5 3.019279 5.2 2.0
 149 virginica 6.2 3.400000 5.4 2.3
 150 virginica 5.9 3.000000 5.1 1.8
 1481 virginica 6.5 3.023392 5.2 2.0

2

Predict new imputed observation using information from the source random forest.

  predict (iris.rf, iris.imputed2 [151,])
      1481 
 virginica 
 Levels: setosa versicolor virginica

There will be problems with dispersion because you do not include the uncertainty implicit in using the imputed data to impose another data point. One way around this is to download.

This works if there is no dependent variable (to predict it does not depend on the dependent variable, so you can just give a matrix of independent variables):

  >>> missY = cbind (NA, iris.imputed2 [151, 2: 5])
 >>> missY
      NA Sepal.Length Sepal.Width Petal.Length Petal.Width
 1481 NA 6.5 3.023392 5.2 2

 >>> predict (iris.rf, missY)
      1481 
 virginica 
 Levels: setosa versicolor virginica
+8
source

All Articles