Assigning a value from a foreach loop

Question

Assigning a value from a foreach loop

I would like to parallelize a loop like

td <- data.frame(cbind(c(rep(1,4),2,rep(1,5)),rep(1:10,2))) names(td) <- c("val","id") res <- rep(NA,NROW(td)) for(i in levels(interaction(td$id))){ res[td$id==i] <- mean(td$val[td$id!=i]) }

using the foreach () library (doParallel) to speed up the calculations. Unfortunately, foreach does not seem to support direct jobs, at least

 registerDoParallel(4) res <- rep(NA,NROW(td)) foreach(i=levels(interaction(td$id))) %dopar%{ res[td$id==i] <- mean(td$val[td$id!=i])}

doesn't do what I want (give the same result as the regular loop above). Any ideas what I'm doing wrong, or somehow “hack” the .combine parameter in foreach to do what I want? Note that the order of the id variable is not always the same in the original dataset. Any hint would be greatly appreciated!

+7

foreach parallel-processing r

chameau13 12 sept '13 at 14:27

source share

2 answers

To perform these calculations efficiently in parallel, you need to use chunking, as separate average calculations do not take much time. When using foreach I often use the functions from the itertools package for chunking. In this case, I use the isplitVector function to create one task for each employee. The results are vectors, so they are combined by simply adding them together, so the vector r must be initialized with a vector of zeros.

 vadd <- function(a, ...) { for (v in list(...)) a <- a + v a } res <- foreach(ids=isplitVector(unique(td$id), chunks=workers), .combine='vadd', .multicombine=TRUE, .inorder=FALSE) %dopar% { r <- rep(0, NROW(td)) for (i in ids) r[td$id == i] <- mean(td$val[td$id != i]) r }

This is a classic example of placing the original sequential version in a foreach , but only for working with a subset of the data. Since there is only one result for each worker, there is very little post-processing, so it works quite efficiently.

To find out how this is done, I compared it with the serial version and with the Rolands data table using the following data set:

 set.seed(107) n <- 1000000 m <- 10000 td <- data.frame(val=rnorm(n), id=sample(m, n, replace=TRUE))

I turn this on because performance is very data dependent. You can even get different results using another random seed.

Here are some test results in my Linux box with an Xeon X5650 processor and 12 GB of RAM:

Sequence for cycle : 359 seconds
Serial version of the data table : 208 seconds
foreach / doParallel / PSOCK with 4 workers : 104 seconds

So, for at least one data set, it is advisable to perform this calculation in parallel. This is not perfect acceleration, but it is not so bad. To run any of these tests on your own computer or with a different data set, you can download them from pastebin using the links above.

Update

After working on these tests, I was interested in using data.table with foreach to get an even faster version. Here's what I came up with (with advice from Matthew Dole):

 cmean <- function(v, mine) if (mine) mean(v) else 0 nuniq <- length(unique(td$id)) res <- foreach(grps=isplitIndices(nuniq, chunks=workers), .combine='vadd', .multicombine=TRUE, .inorder=FALSE, .packages='data.table') %dopar% { td[, means := cmean(td$val[-.I], .GRP %in% grps), by=id] td$means }

td now a data.table object. I used isplitIndices from the isplitIndices package to generate group number vectors associated with each piece of the task. The cmean function is a wrapper around mean , which returns zero for groups that should not be evaluated in this part of the task. It uses the same combination function as the table version without data, since the results of the task are the same.

With four workers and the same data set, this version worked in 56.4 seconds, which is an acceleration of 3.7 compared to the serial version of the data table, which makes it a clear winner 6.4 times faster than a sequential cycle. This test can be downloaded from pastebin here .

+8

Steve weston Sep 13 '13 at 17:04

source share

Rolling · Accepted Answer · 2013-09-12T14:57:15+0000

Your performance boost will be an order of magnitude better if you use data.table to do this instead of parallelizing the loop:

 library(data.table) DT <- data.table(td) DT[, means := mean(DT[-.I, val]), by = id] identical(DT$means, res) #[1] TRUE

If you want to use foreach , you need to combine it with merge :

 library(foreach) res2 <- foreach(i=levels(interaction(td$id)), .combine=rbind) %do% { data.frame(level = i, means = mean(td$val[td$id!=i]))} res2 <- merge(res2, td, by.x = "level", by.y = "id", sort = FALSE) # level means val # 1 1 1.111111 1 # 2 1 1.111111 1 # 3 2 1.111111 1 # 4 2 1.111111 1 # 5 3 1.111111 1 # 6 3 1.111111 1 # 7 4 1.111111 1 # 8 4 1.111111 1 # 9 5 1.000000 2 # 10 5 1.000000 2 # 11 6 1.111111 1 # 12 6 1.111111 1 # 13 7 1.111111 1 # 14 7 1.111111 1 # 15 8 1.111111 1 # 16 8 1.111111 1 # 17 9 1.111111 1 # 18 9 1.111111 1 # 19 10 1.111111 1 # 20 10 1.111111 1

Assigning a value from a foreach loop

More articles: