Parallel random forests with doSMP and foreach significantly increase memory usage (on Windows)

When running random forest in sequential mode, it uses 8 GB of RAM in my system, while in parallel use it uses more than twice the RAM (18 GB). How can I save it up to 8 GB in parallel execution? Here is the code:

install.packages('foreach') install.packages('doSMP') install.packages('randomForest') library('foreach') library('doSMP') library('randomForest') NbrOfCores <- 8 workers <- startWorkers(NbrOfCores) # number of cores registerDoSMP(workers) getDoParName() # check name of parallel backend getDoParVersion() # check version of parallel backend getDoParWorkers() # check number of workers #creating data and setting options for random forests #if your run this please adapt it so it won't crash your system! This amount of data uses up to 18GB of RAM. x <- matrix(runif(500000), 100000) y <- gl(2, 50000) #options set.seed(1) ntree=1000 ntree2 <- ntree/NbrOfCores gc() #running serialized version of random forests system.time( rf1 <- randomForest(x, y, ntree = ntree)) gc() #running parallel version of random forests system.time( rf2 <- foreach(ntree = rep(ntree2, 8), .combine = combine, .packages = "randomForest") %dopar% randomForest(x, y, ntree = ntree)) 
+7
source share
3 answers

First of all, SMP will duplicate the input so that each process gets its own copy. This can be escaped using multicore , but there is another problem - every call to randomForest will also make an internal copy of the input.

The best thing you can do to reduce the use of randomForest to drop the forest model itself (using keep.forest=FALSE ) and do the testing along with training (using the xtest and possibly ytest ).

+3
source

Random forest objects can be very large with moderate-sized datasets, so the increase may be due to the preservation of the model object.

To verify this, you must have two different sessions.

Try to run another model in parallel, which is not large (for example, lda) and see if you have increased in memory.

+1
source

I think the following happens. Since your parent process spawns child processes, memory is shared, i.e. There is no significant increase in plunger use. However, when child processes begin to create random forests, they create many new intermediate objects that are not in shared memory and potentially quite large.

So my answer is that, unfortunately, it's probably not that simple, at least using the randomForest package - although I would be very interested if someone knew about this.

0
source

All Articles