RandomForest does not work when a training set has more different levels of factors than a test set

Question

RandomForest does not work when a training set has more different levels of factors than a test set

When I try to test my training model using new test data with fewer factors than my training data, it predict()returns the following:

The type of predictors in the new data does not match the type of training data.

My training data has a variable with 7 levels of factors, and my test data has the same variable with 6 levels of factors (all 6 AREs in the training data).

When I add an observation containing the “missing” 7th factor, the model runs, so I'm not sure why this is happening or even the logic.

I could see if the test set has more / different levels of factors, then randomForest will strangle, but why in the case when the training set has “more” data?

+4

r random-forest

bmcarterr Jul 21 '14 at 18:49

source share

1 answer

Mrflick · Accepted Answer · 2014-07-21T19:57:47+0000

R expects that both training and test data will have the same levels (even if one of the sets does not have observations for a given level or levels). In your case, since there is no level that the train has in the test dataset, you can do

test$val <- factor(test$val, levels=levels(train$val))

to make sure it has the same levels and they are encoded the same way.

(reordered here to close the question)

RandomForest does not work when a training set has more different levels of factors than a test set

More articles: