Is it possible to repeat one text file on one multicore machine in parallel with R? For context, the text file is somewhere between 250-400 MB of JSON output.
EDIT:
Here are some sample code I played with. To my surprise, parallel processing did not win - just basic, but this may be due to a user error on my part. In addition, while trying to read several large files, my car suffocated.
## test on first 100 rows of 1 twitter file library(rjson) library(parallel) library(foreach) library(plyr) N = 100 library(rbenchmark) mc.cores <- detectCores() benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON), llply(readLines(FILE, n=N, warn=FALSE), fromJSON), mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON), mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON, mc.cores=mc.cores), foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x), replications=100)
Here is the second code example
parseData <- function(x) { x <- tryCatch(fromJSON(x), error=function(e) return(list()) ) ## need to do a test to see if valid data, if so ,save out the files if (!is.null(x$id_str)) { x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y") fname <- paste("rdata/", format(x$created_at, "%m"), format(x$created_at, "%d"), format(x$created_at, "%Y"), "_", x$id_str, sep="") saveRDS(x, fname) rm(x, fname) gc(verbose=FALSE) } } t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))