A random forest packet in R shows an error during prediction () if new levels of factors are present in the test data. Is there any way to avoid this error?

Question

A random forest packet in R shows an error during prediction () if new levels of factors are present in the test data. Is there any way to avoid this error?

I have 30 levels of predictor factor in my training data. I again have 30 level factors for the same predictor in my test data, but some levels are different. And randomForest does not predict if levels will not be the same. It shows an error. says, Error in the forecast. RandomForest (model, test) New levels of factors not present in the training data

+4

r random-forest

Ayush Raj Singh Jun 12 '13 at 6:58

source share

4 answers

Use this to combine levels (here test and train refer to columns in test and training datasets)

 test<-factor(test, levels=levels(train))

+2

Mr Curious Sep 17 '15 at 6:27

source share

A simple solution for this would be to rbind your test data with your workout data, make a prediction and a subset of the rbind data from the predictions. Test method

+1

Mahi Nov 04 '15 at 13:17

source share

This is a problem that occurs when the level of your test data does not match the level of training data.

The simple fix you can make is that

load test data with character column as factors
then rbind () test data with train data
Now extract the test data lines from step 2 and go to the forecast

You can also try

test_data <- factor (test_data, levels = levels (train_data))

0

Rijin Feb 19 '19 at 10:41

source share

Tommy levi · Accepted Answer · 2013-06-12T16:58:06+0000

The workaround I found is to first convert the factor variables in your train and test suite to characters

test$factor <- as.character(test$factor)

Then add a column for each with a flag for test / train, i.e.

 test$isTest <- rep(1,nrow(test)) train$isTest <- rep(0,nrow(train))

Then return them

 fullSet <- rbind(test,train)

Then convert back to coefficient

 fullSet$factor <- as.factor(fullSet$factor)

This ensures that both sets of tests and trains have the same levels. Then you can deselect:

 test.new <- fullSet[fullSet$isTest==1,] train.new <- fullSet[fullSet$isTest==0,]

and you can drop / NULL from the isTest column from each. Then you will have sets with the same levels that you can train and test. It might have been a more elegant solution, but it worked for me in the past, and you can write it into a small function if you need to repeat it often.

A random forest packet in R shows an error during prediction () if new levels of factors are present in the test data. Is there any way to avoid this error?

More articles: