How can I speed up the preparation of my random forest?

Question

How can I speed up the preparation of my random forest?

I am trying to train several random forests (for regression) so that they compete and see which choice of functions and which parameters give the best model.

However, the trainings seem to take an insane amount of time, and I wonder if I'm not doing something wrong.

The dataset I use for training (called train below) has 217k rows and 58 columns (of which only 21 are predictors in a random forest. They are all numeric or integer except for a boolean that has a character class. The output y is numeric )

I executed the following code four times, specifying the values 4 , 100 , 500 , 2000 on nb_trees :

 library("randomForest") nb_trees <- #this changes with each test, see above ptm <- proc.time() fit <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21, data = train, ntree = nb_trees, do.trace=TRUE) proc.time() - ptm

Here is how long each of them trained:

 nb_trees | time 4 4mn 100 1h 41mn 500 8h 40mn 2000 34h 26mn

As my company server has 12 cores and 125Go of RAM, I decided that I could try to parallelize the training by following this answer (however, I used doParallel because it seemed to work forever with doSNOW , I don’t know why. And I can’t find where I saw that doParallel will work too, sorry).

 library("randomForest") library("foreach") library("doParallel") nb_trees <- #this changes with each test, see table below nb_cores <- #this changes with each test, see table below cl <- makeCluster(nb_cores) registerDoParallel(cl) ptm <- proc.time() fit <- foreach(ntree = rep(nb_trees, nb_cores), .combine = combine, .packages = "randomForest") %dopar% { randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21, data = train, ntree = ntree, do.trace=TRUE)} proc.time() - ptm stopCluster(cl)

When I run it, it takes a shorter time than the unparalleled code:

 nb_trees | nb_cores | total number of trees | time 1 4 4 2mn13s 10 10 100 52mn 9 12 108 (closest to 100 with 12 cores) 59mn 42 12 504 (closest to 500 with 12 cores) I won't be running this one 167 12 2004 (closest to 2000 with 12 cores) I'll run it next week-end

However, I think it still takes a lot of time, right? I know that it takes time to combine trees into a final forest, so I did not expect it to be 12 times faster with 12 cores, but it is only ~ 2 times faster ...

This is normal?
If this is not the case, is there anything I can do with my data and / or my code to drastically reduce the runtime?
If not, should I tell the guy in charge of the server that it should be much faster?

Thank you for your responses.

Notes:

I'm the only one using this server
for my next tests, I will get rid of columns that are not used in a random forest.
I realized quite late that I could improve the execution time by calling randomForest(predictors,decision) instead of randomForest(decision~.,data=input) , and I will do it from now on, but I think my questions above are still persisting .

+8

parallel-processing r random-forest doparallel parallel-foreach

François M. May 12, '16 at 14:36

source share

2 answers

The randomForest() function can receive data using either the "formula interface" or the "matrix interface". The matrix interface is known to provide much better performance metrics.

Formula Interface:

 rf.formula = randomForest(Species ~ ., data = iris)

Matrix Interface:

 rf.matrix = randomForest(y = iris[, 5], x = iris[, 1:4])

+2

user1808924 May 12, '16 at 18:56

source share

Tim biegeleisen · Accepted Answer · 2016-05-12T14:54:45+0000

Although I am a fan of brute force methods such as parallelizing or running code for a very long time, I am even more a fan of improving the algorithm to avoid using brute force.

While training your random forest using 2000 trees has become overly expensive, training with fewer trees has taken a more reasonable time. To get started, you can train with trees 4 , 8 , 16 , 32 , ... , 256 , 512 and carefully monitor metrics that let you know how reliable the model is. These metrics include things like the best consistent model (how well your forest works in the data set compared to the model that predicts the median for all inputs), as well as an error outside the package. In addition, you can observe the top predictors and their importance, and whether you begin to see convergence there when you add more trees.

Ideally, you do not need to use thousands of trees to create the model. Once your model begins to converge, adding more trees will not necessarily degrade the model, but at the same time it will not add any new information. By avoiding using too many trees, you can reduce the calculation that would be occupied during the week in less than a day. If, among other things, you use a dozen processor cores, then you can watch something on the order of the clock.

To view a variable value after each random forest run, you can try something in the following lines:

 fit <- randomForest(...) round(importance(fit), 2)

I understand that the former say that the 5-10 predictors have the greatest influence on the model. If you notice that, by increasing the trees, these upper predictors do not change position relative to each other, and the importance indicators remain unchanged, then you may not need to use as many trees.

How can I speed up the preparation of my random forest?

More articles: