I have a dataframe df, I am creating a machine learning model (C5.0 decision tree) to predict the column class (loan_approved):
Structure (not real data):
id occupation income loan_approved 1 business 4214214 yes 2 business 32134 yes 3 business 43255 no 4 sailor 5642 yes 5 teacher 53335 no 6 teacher 6342 no
Process:
- I randomly split the data frame into a test and coached it, recognized the data set on the train (lines 1,2,3,5,6 of the train and line 4 as a test).
- To accommodate new categorical levels in one or more columns, I used the try function
Function:
error_free_predict = function(x){ output = tryCatch({ predict(C50_model, newdata = test[x,], type = "class") }, error = function(e) { "no" }) return(output) }
Prediction function applied:
test <- mutate(test, predicted_class = error_free_predict(1:NROW(test)))
Problem:
id occupation income loan_approved predicted_class 1 business 4214214 yes no 2 business 32134 yes no 3 business 43255 no no 4 sailor 5642 yes no 5 teacher 53335 no no 6 teacher 6342 no no
Question:
I know that this is due to the fact that the test data frame had a new level that is missing from the train data, but should my function not work in all cases except this?
PS: did not use sapply because it was too slow
source share