Can readLines run in parallel inside R

Is it possible to repeat one text file on one multicore machine in parallel with R? For context, the text file is somewhere between 250-400 MB of JSON output.

EDIT:

Here are some sample code I played with. To my surprise, parallel processing did not win - just basic, but this may be due to a user error on my part. In addition, while trying to read several large files, my car suffocated.

## test on first 100 rows of 1 twitter file library(rjson) library(parallel) library(foreach) library(plyr) N = 100 library(rbenchmark) mc.cores <- detectCores() benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON), llply(readLines(FILE, n=N, warn=FALSE), fromJSON), mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON), mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON, mc.cores=mc.cores), foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x), replications=100) 

Here is the second code example

 parseData <- function(x) { x <- tryCatch(fromJSON(x), error=function(e) return(list()) ) ## need to do a test to see if valid data, if so ,save out the files if (!is.null(x$id_str)) { x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y") fname <- paste("rdata/", format(x$created_at, "%m"), format(x$created_at, "%d"), format(x$created_at, "%Y"), "_", x$id_str, sep="") saveRDS(x, fname) rm(x, fname) gc(verbose=FALSE) } } t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData)) 
+6
source share
2 answers

The answer depends on what the problem really is: reading the file in parallel or processing the file in parallel.

Parallel reading

You can split the JSON file into several input files and read them in parallel, for example. using plyr functions in combination with a parallel backend:

 result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE) 

Backend registration can be done using the parallel package, which is now integrated into the R base. Or you can use the doSNOW package, see this blog post for details.

Parallel processing

In this case, it is best to read the entire data set into a character vector, split the data, and then use a parallel backend in combination with, for example, plyr .

+7
source

probably not with readLines() due to the nature of the non-parallel IO file system. Of course, if you use parallel NFS or something like HDFS, this restriction will not apply. But, assuming you are in a "standard" architecture, it is not possible to parallelize your readLine() calls.

It would be best to read the entire file, seeing that maybe 500 MB in memory, and then parallelize the processing after you have already read the object.

+2
source

All Articles