How to specify mapping configuration and java options with custom bank in CLI using Amazon EMR?

Question

How to specify mapping configuration and java options with custom bank in CLI using Amazon EMR?

I would like to know how to specify mapreduce configurations like mapred.task.timeout, mapred.min.split.size , etc. when starting a streaming job using a custom jar.

We can use the following method to specify these configurations at startup using external scripting languages such as ruby or python:

ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout = 0 --jobconf mapred.min.split.size = 52880 --mapper s3: // somepath / mapper.rb --reducer s3: somepath / reducer.rb - input s3: // somepath / input --output s3: // somepath / output

I tried the following methods, but none of them worked:

ruby elastic-mapreduce --jobflow --jar s3: //somepath/job.jar --arg s3: // somepath / input --arg s3: // somepath / output --args -m, mapred.min .split.size = 52880 -m, mapred.task.timeout = 0
ruby elastic-mapreduce --jobflow --jar s3: //somepath/job.jar --arg s3: // somepath / input --arg s3: // somepath / output --args -jobconf, mapred.min .split.size = 52880 -jobconf, mapred.task.timeout = 0

I would also like to learn how to pass java parameters to a streaming job using a custom jar in EMR. When running locally on hadoop, we can pass it as follows:

bin / hasoop jar job.jar input_path output_path -D <some_java_parameter> = <some_value>

+7

java elastic-map-reduce mapreduce hadoop emr

Amar Feb 14 '12 at 20:45

source share

2 answers

In the context of Amazon Elastic MapReduce (Amazon EMR) , you are looking for Download Actions :

Bootstrap actions let you pass a link to a script stored in Amazon S3. This script may contain configuration parameters and arguments related to Hadoop or Elastic MapReduce . Download actions run before Hadoop starts and before node data processing begins. [emphasis mine]

The section Performing custom Bootstrap actions from the CLI provides a common use case:

 & ./elastic-mapreduce --create --stream --alive \ --input s3n://elasticmapreduce/samples/wordcount/input \ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ --output s3n://myawsbucket --bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh

In particular, there are separate download steps for configuring Hadoop and Java:

Hadoop (cluster)

You can specify Hadoop settings using the bootstrap command Configure Hadoop , which allows you to set Hadoop settings for the entire cluster, for example:

 $ ./elastic-mapreduce --create \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.task.timeout=0"

Java (JVM)

You can specify custom JVM settings using the boot operation Configuring daemons :

This predefined boot action allows you to specify heap size or other Java Virtual Machine (JVM) settings for Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM parameters, such as garbage collection behavior.

The above example sets the heap size to 2048 and sets the Java namenode parameter:

 $ ./elastic-mapreduce –create –alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \ --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

+6

Steffen opel Apr 05 '12 at 9:42

source share

Dolan antenucci · Accepted Answer · 2012-10-31T20:40:21+0000

I believe that if you want to set them for each task, then you need

A) For custom Jars, pass them to your jar as arguments and process them yourself. I believe that this can be automated as follows:

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); args = new GenericOptionsParser(conf, args).getRemainingArgs(); //.... }

Then create the task this way (don't check if this works):

  > elastic-mapreduce --jar s3://mybucket/mycode.jar \ --args "-D,mapred.reduce.tasks=0" --arg s3://mybucket/input \ --arg s3://mybucket/output

GenericOptionsParser should automatically transfer the -D and -jobconf options to the Hadoop job setting. More details: http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

B) for bandage streaming, you also just pass the configuration change to the command

 > elastic-mapreduce --jobflow j-ABABABABA \ --stream --jobconf mapred.task.timeout=600000 \ --mapper s3://mybucket/mymapper.sh \ --reducer s3://mybucket/myreducer.sh \ --input s3://mybucket/input \ --output s3://mybucket/output \ --jobconf mapred.reduce.tasks=0

More details: https://forums.aws.amazon.com/thread.jspa?threadID=43872 and elastic-mapreduce --help

How to specify mapping configuration and java options with custom bank in CLI using Amazon EMR?

Hadoop (cluster)

Java (JVM)

More articles: