"foreach" parallel loop returns <NA> s
I am trying to process multiple list items in parallel.
My goal: to run some labeling function for each column based on its values. Then return a dataframe with the name node, column name and processed label
The workflow is working fine using a normal loop. However, when I try to do the same in the foreach loop, the results are returned (Note: the following is just an abstraction of the original dataset)
I'm not sure what exactly messed up between them. If you can help me figure this out, that would be awesome :-)
set.seed(12345) options(stringsAsFactors = F) # I. Random data generation (Original data is in data frame format) random.data = list() random.data[["one"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3)) random.data[["two"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3)) random.data[["three"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3)) # II. Some function applied to each column to label/classify the values valslabel = function(DataCOlumn) { if(mean(DataCOlumn) < 0.5) return("low") return("high") } # III. Generating the desired output in a regular for loop : desiredOutput = list() for(frame.i in seq_along(random.data)) { frame = random.data[[frame.i]] frame.name = names(random.data)[frame.i] frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0) ) for(col.i in 1:ncol(frame)) { frame.results[col.i, "frame.name"] = frame.name frame.results[col.i, "mappedField"] = colnames(frame)[col.i] frame.results[col.i, "label"] = valslabel(frame[,col.i]) } desiredOutput[[frame.name]] = frame.results } print(desiredOutput) # $one # frame.name mappedField label # 1 one V1 high # 2 one V2 high # 3 one V3 low # # $two # frame.name mappedField label # 1 two V1 low # 2 two V2 high # 3 two V3 low # # $three # frame.name mappedField label # 1 three V1 low # 2 three V2 high # 3 three V3 high # IV. Using the "foreach" parallel execution library(foreach) library(doParallel) cl = makeCluster(6) registerDoParallel(cl) output = foreach(frame.i = seq_along(random.data), .verbose = T) %dopar% { frame = random.data[[frame.i]] frame.name = names(random.data)[frame.i] frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0) ) for(col.i in 1:ncol(frame)) { frame.results[col.i, "frame.name"] = frame.name frame.results[col.i, "mappedField"] = colnames(frame)[col.i] frame.results[col.i, "label"] = valslabel(frame[,col.i]) } return(frame.results) } print(output) # [[1]] # frame.name mappedField label # 1 <NA> <NA> <NA> # 2 <NA> <NA> <NA> # 3 <NA> <NA> <NA> # # [[2]] # frame.name mappedField label # 1 <NA> <NA> <NA> # 2 <NA> <NA> <NA> # 3 <NA> <NA> <NA> # # [[3]] # frame.name mappedField label # 1 <NA> <NA> <NA> # 2 <NA> <NA> <NA> # 3 <NA> <NA> <NA> Thanks!
The problem is how you initialize your data frame and the fact that in the foreach environment the stringsAsFactors parameter stringsAsFactors not set to FALSE . What happens in each foreach looks something like this:
options(stringsAsFactors = FALSE) d <- data.frame(x =character(0)) d[1, "x"] <- "a" #Warning message: #In `[<-.factor`(`*tmp*`, iseq, value = "a") : # invalid factor level, NA generated d # x #1 <NA> Note that this gives a warning, not an error, so the loop does not stop. If you set stringsAsFactors to FALSE , then there will be no problems at first (as you did when you did not run files in parallel)
options(stringsAsFactors = FALSE) d <- data.frame(x =character(0)) d[1, "x"] <- "a" d # x #1 a In your global environment, you already set options(stringsAsFactors = FALSE) , so the %do% loop worked. However, this option is not transmitted in the local environment of each parallel job, so the %dopar% works with the problem above.
See an example of the output of the following
options(stringsAsFactors = FALSE) .Options$stringsAsFactors #[1] FALSE foreach(i = 1:3) %dopar% .Options$stringsAsFactors #[[1]] #[1] TRUE # #[[2]] #[1] TRUE # #[[3]] #[1] TRUE So the solution is to set the stringsAsFactors = FALSE option inside the foreach .
As an aside, it is much better to create your own data frame using, if possible, the entire column vector, rather than row by row. In your example, you can replace
frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0)) for(col.i in 1:ncol(frame)) { frame.results[col.i, "frame.name"] = frame.name frame.results[col.i, "mappedField"] = colnames(frame)[col.i] frame.results[col.i, "label"] = valslabel(frame[,col.i]) } with
frame.results <- data.frame( frame.name = frame.name, mappedField = colnames(frame), label = valslabel1(colMeans(frame))) where the valslabel function valslabel been replaced with a vectorized version
valslabel1 <- function(x) { ifelse(x < 0.5, "low", "high") }