I am trying to use Amazon Elastic Map Reduce to run a series of simulations from several million cases. This is an Rscript streaming job without a reducer. I am using Identity Reducer in my EMR call --reducer org.apache.hadoop.mapred.lib.IdentityReducer.
The script file works fine during testing and runs locally from the command line in the Linux box when passing one line of the line manually echo "1,2443,2442,1,5" | ./mapper.R, and I get one line of results that I expect. However, when I tested my simulation using about 10,000 cases (lines) from an input file in EMR, I only got output for a dozen lines or so from 10k input lines. I tried several times and I can not understand why. Hadoop's job works fine without errors. It seems like the input lines are being skipped or maybe something is happening with the Identity reducer. The results are true for cases where there is a way out.
My input file is a csv with the following data format, a series of five integers separated by commas:
1,2443,2442,1,5
2,2743,4712,99,8
3,2443,861,282,3177
etc...
Here is my R script for mapper.R
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
# function to read in the relevant data from needed data files
get.data <- function(casename) {
list <- lapply(casename, function(x) {
read.csv(file = paste("./inputdata/",x, ".csv", sep = ""),
header = TRUE,
stringsAsFactors = FALSE)})
return(data.frame(list))
}
con <- file("stdin")
line <- readLines(con, n = 1, warn = FALSE)
line <- trimWhiteSpace(line)
values <- unlist(strsplit(line, ","))
lv <- length(values)
cases <- as.numeric(values[2:lv])
simid <- paste("sim", values[1], ":", sep = "")
l <- length(cases)
names.vector <- paste("case", cases, sep = ".")
metadata <- read.csv(file = "./inputdata/metadata.csv", header = TRUE,
stringsAsFactors = FALSE)
d <- cbind(metadata[,1:3], get.data(names.vector))
cat(output, "\n")
close(con)
() EMR- :
ruby elastic-mapreduce --create --stream --input s3n://bucket/project/input.txt --output s3n://bucket/project/output --mapper s3n://bucket/project/mapper.R --reducer org.apache.hadoop.mapred.lib.IdentityReducer --cache-archive s3n://bucket/project/inputdata.tar.gz#inputdata --name Simulation --num-instances 2
- , , , / R script.
- script R, . , EMR. JD Long's R/EMR script.