Avoid duplicating objects with foreach

Question

Avoid duplicating objects with foreach

I have a very large row vector and you want to do parallel computations using foreach and dosnow . I noticed that foreach will make copies of the vector for each process, thereby quickly unloading system memory. I tried to break the vector into smaller parts in a list object, but still do not see a reduction in memory usage. Anyone have any thoughts on this? Below is a demo code:

 library(foreach) library(doSNOW) library(snow) x<-rep('some string', 200000000) # split x into smaller pieces in a list object splits<-getsplits(x, mode='bysize', size=1000000) tt<-vector('list', length(splits$start)) for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]] ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])

+7

parallel-processing r mpi

baidao May 05 '13 at 11:39

source share

1 answer

Steve weston · Answer 1 · 2013-05-05T14:41:27+0000

The iteration style you use works well with the doMC backend, because workers can efficiently share tt fork magic. But with doSNOW , tt will be automatically exported by workers, using a lot of memory, even if they actually need a small part. The proposal made by @Beasterfield to iterate directly over tt solves this problem, but it can be even more memory efficient by using iterators and the corresponding parallel backend.

In such cases, I use the isplitVector function from the itertools package. It splits the vector into a sequence of subvectors, allowing them to be processed in parallel, without losing the advantages of vectorization. Unfortunately, with doSNOW it puts these subvectors into a list to call the clusterApplyLB function in snow , since clusterApplyLB does not support iterators. However, the doMPI and doRedis will not do this. They will send sub-vectors to the workers to the right of the iterator, using almost half the memory.

Here is a complete example using doMPI :

 suppressMessages(library(doMPI)) library(itertools) cl <- startMPIcluster() registerDoMPI(cl) n <- 20000000 chunkSize <- 1000000 x <- rep('some string', n) somefun <- function(s) toupper(s) ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% { somefun(s) } print(length(ret)) closeCluster(cl) mpi.quit()

When I run it on my MacBook Pro with 4 GB of memory

 $ time mpirun -n 5 R --slave -f split.R

It takes about 16 seconds.

You should be careful about the number of workers created on the same machine, although decreasing the chunkSize value may allow you to start more.

You can reduce memory usage even more if you can use an iterator that does not require all rows to be in memory at the same time. For example, if the lines are in a file called 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize) .

Avoid duplicating objects with foreach

More articles: