R randomForest for classification

I am trying to make a classification with randomForest, but I repeatedly get an error message for which there seems to be no visible solution (randomForest worked fine for me, doing regression in the past). I pasted my code below. "success" is a factor; all dependent variables are numbers. Any suggestions on how to properly perform this classification?

> rf_model<-randomForest(success~.,data=data.train,xtest=data.test[,2:9],ytest=data.test[,1],importance=TRUE,proximity=TRUE) Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) 

also, here is a sample dataset:

head (data)

 success duration goal reward_count updates_count comments_count backers_count min_reward_level max_reward_level True 20.00000 1500 10 14 2 68 1 1000 True 30.00000 3000 10 4 3 48 5 1000 True 24.40323 14000 23 6 10 540 5 1250 True 31.95833 30000 9 17 7 173 1 10000 True 28.13211 4000 10 23 97 2936 10 550 True 30.00000 6000 16 16 130 2043 25 500 
+7
source share
5 answers

Did you try to regress with the same data? if not, then check the β€œInf” values ​​in your data and try to delete them, if any, after removing NA and NaN. You can find useful information on removing Inf below.

R is there a way to find the values ​​of Inf / -Inf?

Example

 Class V1 V2 V3 V4 V5 V6 V7 V8 V9 1 11 Inf 4 232 23 2 2 34 0.205567767 1 11 123 4 232 23 1 2 34 0.162357601 1 13 123 4 232 23 1 2 34 -0.002739357 1 13 123 4 232 23 1 2 34 0.186989878 2 67 14 4 232 67 1 2 34 0.109398677 2 67 14 4 232 67 2 2 34 0.18491187 2 67 14 4 232 34 2 2 34 0.098728256 2 44 769.03 4 21 34 2 2 34 0.204405869 2 44 34 4 11 34 1 2 34 0.218426408 # When Classification was performed, following error pops out. rf_model<-randomForest(as.factor(Class)~.,data=data,importance=TRUE,proximity=TRUE) Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) # Regression was performed, following error pops out. rf_model<-randomForest(Class~.,data=data,importance=TRUE,proximity=TRUE) Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) 

So, check your details very carefully. Additionally: Warning message: In randomForest.default (m, y, ...): The response has five or less unique values. Are you sure you want to make a regression?

+2
source

Besides the obvious facts surrounding the presence of NA, etc., this error is almost always caused by the presence of characteristic types in the data set. A way to understand this is to think about what a random forest does. You break the function of the data set into a function. So, if one of the functions is a character vector, how would you divide the data set? You need categories for sharing data. How many "male" and "female" categories ...

For numerical functions, such as Age or Price, you can create categories by coquetting; more than a certain age, less than a certain price, etc. You cannot do this with pure character traits. Therefore, you need them as factors in your data set.

+8
source

In general, there are two main reasons why you get this error message:

  • If the data frame contains a character vector column instead of factors. Just convert character column to factor

2. If the data contains bad values, applying a random forest will also cause this error. The chapter does not display outlier values. For example:

x = rep (x = sample (c (0,1)), times = 24)

 y = c(sample.int(n=50,size = 40),Inf,Inf) df = data.frame(col1 = x , col2 = y ) head(df) col1 col2 > 1 1 26 > 2 0 33 > 3 1 23 > 4 0 21 > 5 1 45 > 6 0 27 

Now applying randomForest to df will result in the same error:

model = randomForest (data = df, col2 ~ col1, ntree = 10)

Error in randomForest.default (m, y, ...): NA / NaN / Inf in an external function call (arg 2)

Decision. Allows you to identify bad values ​​in df. As stated above, the is.finite () method checks whether the input vector contains the correct final values ​​or not. For example:

is.finite (s (5,6,1000000, NaN Inf))
[1] TRUE TRUE TRUE FALSE FALSE

Now let's identify the columns containing the bad values ​​in our data frame and count them.

sum (! is.finate (as.vector (df [, names (df)% in% c ("col2")])))
[14
sum (! is.finate (as.vector (df [, names (df)% in% c ("col1")])))
[10

Allows you to delete these entries and just take good entries:

df1 = df [is.finite (as.vector (df [, names (df)% in% c ("col2")])) &
is.finite (as.vector (df [, names (df)% in% c ("col1")])),]

And run randomForest again:

model1 = randomForest (data = df1, col2 ~ col1, ntree = 10)
Call:
randomForest (formula = col2 ~ col1, data = df1, ntree = 10)

+5
source

This is because there are more than 32 levels for one of your variables. Levels mean different values ​​for one variable. Delete this variable and try again.

0
source

By simply converting all columns to a coefficient, you can avoid this error. Even I ran into this error. A column, in particular, that did not turn into a factor. I wrote specifically for this. And finally, my code worked.

0
source

All Articles