Using plyr, doMC and summarize () with a very large dataset?

Question

Using plyr, doMC and summarize () with a very large dataset?

I have a fairly large data set (~ 1.4 m rows), which I am doing some separation and generalization. All this takes time to start, and my final application depends on frequent starts, so I thought to use doMC and the .parallel=TRUE flag with plyr, like this (simplified):

 library(plyr) require(doMC) registerDoMC() df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

If I set the number of cores in explicit form to two (using registerDoMC(cores=2) ), my 8 GB of RAM will see me and it shaves a decent amount of time. However, if I allow all 8 cores to be used, I quickly run out of memory because each forked process seems to clone the entire data set in memory.

My question is, can parallel plyr runtimes be used in a more economical memory mode? I tried to convert my data file to big.matrix , but it just made everything return to using one core:

 library(plyr) library(doMC) registerDoMC() library(bigmemory) bm <- as.big.matrix(df) df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

This is my first foray into multi-core R computing, so if there is a better way to think about it, I am open to suggestions.

UPDATE: As with many things in life, it turns out I did Other nonsense elsewhere in my code, and that the whole issue of multiprocessing becomes a moot point in this particular case. However, for big data folding tasks I will remember data.table . I was able to reproduce my addition problem in a simple way.

+7

r data.table plyr

Peter Dec 29 '11 at 14:54

source share

1 answer

Paul hiemstra · Accepted Answer · 2011-12-29T15:31:48+0000

I do not think plyr makes copies of the entire data set. However, when processing a piece of data, this subset is copied to the employee. Thus, when using more workers, more subsets are stored at the same time (i.e. 8 instead of 2).

I can come up with some tips you could try:

Put your data in an array structure instead of data.frame and use adobe to generalize. arrays are much more efficient in terms of memory usage and speed. I mean using normal matrices, not big.matrix.
Give data.table try, in some cases this can lead to an increase in speed by several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table can be faster. See a blog post comparing ave , ddply and data.table for handling pieces of data.

Using plyr, doMC and summarize () with a very large dataset?

More articles: