How to handle errors in the prediction function R?

I have a dataframe df, I am creating a machine learning model (C5.0 decision tree) to predict the column class (loan_approved):

Structure (not real data):

id occupation income loan_approved 1 business 4214214 yes 2 business 32134 yes 3 business 43255 no 4 sailor 5642 yes 5 teacher 53335 no 6 teacher 6342 no 

Process:

  • I randomly split the data frame into a test and coached it, recognized the data set on the train (lines 1,2,3,5,6 of the train and line 4 as a test).
  • To accommodate new categorical levels in one or more columns, I used the try function

Function:

  error_free_predict = function(x){ output = tryCatch({ predict(C50_model, newdata = test[x,], type = "class") }, error = function(e) { "no" }) return(output) } 

Prediction function applied:

 test <- mutate(test, predicted_class = error_free_predict(1:NROW(test))) 

Problem:

 id occupation income loan_approved predicted_class 1 business 4214214 yes no 2 business 32134 yes no 3 business 43255 no no 4 sailor 5642 yes no 5 teacher 53335 no no 6 teacher 6342 no no 

Question:

I know that this is due to the fact that the test data frame had a new level that is missing from the train data, but should my function not work in all cases except this?

PS: did not use sapply because it was too slow

+5
source share
2 answers

There are two parts to this problem.

  • The first part of the problem arises during the training of the model, since categorical variables are not divided evenly between the train and the test, if you perform a random partition. In your case, they say that you have only one record with the occupation of “sailor”, then it is quite possible that it will end in the test set when you split. A model built using a train dataset would never have seen the influence of a “sailor” occupation, and therefore this would cause an error. In a more general case, it is possible that some other categorical level of the variable will be fully tested after random splitting.

Therefore, instead of dividing the data randomly between the train and the test, you can make a stratified sampling. Code using data.table to split 70:30:

 ind <- total_data[, sample(.I, round(0.3*.N), FALSE),by="occupation"]$V1 train <- total_data[-ind,] test <- total_data[ind,] 

This ensures that any level is evenly divided between the data set of trains and tests. Thus, you will not get a “new” categorical level in the test dataset; which may be in the case of a random splitting.

  1. The second part of the problem arises when the model is in production, and it encounters a completely new variable, which was not even in the training or test set. To solve this problem, you can save a list of all levels of all categorical variables using lvl_cat_var1 <- unique(cat_var1) and lvl_cat_var2 <- unique(cat_var2) , etc. Then, before predicting, you can check the new level and filter:

     new_lvl_data <- total_data[!(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)] pred_data <- total_data[(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)] 

then for the default forecast run:

 new_lvl_data$predicted_class <- "no" 

and full-blown prediction for pred_data.

+1
source

I usually do this using a loop where any levels outside the train will be transcoded as NA by this function. Here, the train is the data that you used to train the model, and the test is the data that will be used to predict.

 for(i in 1:ncol(train)){ if(is.factor(train[,i])){ test[,i] <- factor(test[,i],levels=levels(train[,i])) } } 

Trycatch is an error handling mechanism, that is, after an error occurs. This does not apply if you do not want to do something else after the error has occurred. But you still want to run the model, then this cycle will take care of new levels.

0
source

All Articles