I have a fairly large data set (~ 1.4 m rows), which I am doing some separation and generalization. All this takes time to start, and my final application depends on frequent starts, so I thought to use doMC
and the .parallel=TRUE
flag with plyr, like this (simplified):
library(plyr) require(doMC) registerDoMC() df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)
If I set the number of cores in explicit form to two (using registerDoMC(cores=2)
), my 8 GB of RAM will see me and it shaves a decent amount of time. However, if I allow all 8 cores to be used, I quickly run out of memory because each forked process seems to clone the entire data set in memory.
My question is, can parallel plyr runtimes be used in a more economical memory mode? I tried to convert my data file to big.matrix
, but it just made everything return to using one core:
library(plyr) library(doMC) registerDoMC() library(bigmemory) bm <- as.big.matrix(df) df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)
This is my first foray into multi-core R computing, so if there is a better way to think about it, I am open to suggestions.
UPDATE: As with many things in life, it turns out I did Other nonsense elsewhere in my code, and that the whole issue of multiprocessing becomes a moot point in this particular case. However, for big data folding tasks I will remember data.table
. I was able to reproduce my addition problem in a simple way.
Peter
source share