The iteration style you use works well with the doMC backend, because workers can efficiently share tt fork magic. But with doSNOW , tt will be automatically exported by workers, using a lot of memory, even if they actually need a small part. The proposal made by @Beasterfield to iterate directly over tt solves this problem, but it can be even more memory efficient by using iterators and the corresponding parallel backend.
In such cases, I use the isplitVector function from the itertools package. It splits the vector into a sequence of subvectors, allowing them to be processed in parallel, without losing the advantages of vectorization. Unfortunately, with doSNOW it puts these subvectors into a list to call the clusterApplyLB function in snow , since clusterApplyLB does not support iterators. However, the doMPI and doRedis will not do this. They will send sub-vectors to the workers to the right of the iterator, using almost half the memory.
Here is a complete example using doMPI :
suppressMessages(library(doMPI)) library(itertools) cl <- startMPIcluster() registerDoMPI(cl) n <- 20000000 chunkSize <- 1000000 x <- rep('some string', n) somefun <- function(s) toupper(s) ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% { somefun(s) } print(length(ret)) closeCluster(cl) mpi.quit()
When I run it on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
It takes about 16 seconds.
You should be careful about the number of workers created on the same machine, although decreasing the chunkSize value may allow you to start more.
You can reduce memory usage even more if you can use an iterator that does not require all rows to be in memory at the same time. For example, if the lines are in a file called 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize) .
Steve weston
source share