I am trying to train several random forests (for regression) so that they compete and see which choice of functions and which parameters give the best model.
However, the trainings seem to take an insane amount of time, and I wonder if I'm not doing something wrong.
The dataset I use for training (called train below) has 217k rows and 58 columns (of which only 21 are predictors in a random forest. They are all numeric or integer except for a boolean that has a character class. The output y is numeric )
I executed the following code four times, specifying the values 4 , 100 , 500 , 2000 on nb_trees :
library("randomForest") nb_trees <-
Here is how long each of them trained:
nb_trees | time 4 4mn 100 1h 41mn 500 8h 40mn 2000 34h 26mn
As my company server has 12 cores and 125Go of RAM, I decided that I could try to parallelize the training by following this answer (however, I used doParallel because it seemed to work forever with doSNOW , I don’t know why. And I can’t find where I saw that doParallel will work too, sorry).
library("randomForest") library("foreach") library("doParallel") nb_trees <- #this changes with each test, see table below nb_cores <- #this changes with each test, see table below cl <- makeCluster(nb_cores) registerDoParallel(cl) ptm <- proc.time() fit <- foreach(ntree = rep(nb_trees, nb_cores), .combine = combine, .packages = "randomForest") %dopar% { randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21, data = train, ntree = ntree, do.trace=TRUE)} proc.time() - ptm stopCluster(cl)
When I run it, it takes a shorter time than the unparalleled code:
nb_trees | nb_cores | total number of trees | time 1 4 4 2mn13s 10 10 100 52mn 9 12 108 (closest to 100 with 12 cores) 59mn 42 12 504 (closest to 500 with 12 cores) I won't be running this one 167 12 2004 (closest to 2000 with 12 cores) I'll run it next week-end
However, I think it still takes a lot of time, right? I know that it takes time to combine trees into a final forest, so I did not expect it to be 12 times faster with 12 cores, but it is only ~ 2 times faster ...
- This is normal?
- If this is not the case, is there anything I can do with my data and / or my code to drastically reduce the runtime?
- If not, should I tell the guy in charge of the server that it should be much faster?
Thank you for your responses.
Notes:
- I'm the only one using this server
- for my next tests, I will get rid of columns that are not used in a random forest.
- I realized quite late that I could improve the execution time by calling
randomForest(predictors,decision) instead of randomForest(decision~.,data=input) , and I will do it from now on, but I think my questions above are still persisting .