Add streaming step to MR job in boto3 running on AWS EMR 5.0

I am trying to transfer several MR jobs that I wrote on python from AWS EMR 2.4 to AWS EMR 5.0. So far I have used boto 2.4, but it does not support EMR 5.0, so I am trying to upgrade to boto3. Previously, using boto 2.4, I used the StreamingStep module to specify the input location and output location, as well as the location of the files of my mapper and reducer. Using this module, I actually did not need to create or load any jar to complete my tasks. However, I cannot find the equivalent for this module anywhere in the boto3 documentation. How to add the streaming step to boto3 in my MR job so that I don’t have to download the jar file to run it?

+7
python amazon-web-services emr boto3
source share
1 answer

Unfortunately, boto3 and the EMR API are pretty poorly documented. In the minimum case, an example of word counting would look like this:

 import boto3 emr = boto3.client('emr') resp = emr.run_job_flow( Name='myjob', ReleaseLabel='emr-5.0.0', Instances={ 'InstanceGroups': [ {'Name': 'master', 'InstanceRole': 'MASTER', 'InstanceType': 'c1.medium', 'InstanceCount': 1, 'Configurations': [ {'Classification': 'yarn-site', 'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]}, {'Name': 'core', 'InstanceRole': 'CORE', 'InstanceType': 'c1.medium', 'InstanceCount': 1, 'Configurations': [ {'Classification': 'yarn-site', 'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]}, ]}, Steps=[ {'Name': 'My word count example', 'HadoopJarStep': { 'Jar': 'command-runner.jar', 'Args': [ 'hadoop-streaming', '-files', 's3://mybucket/wordSplitter.py#wordSplitter.py', '-mapper', 'python2.7 wordSplitter.py', '-input', 's3://mybucket/input/', '-output', 's3://mybucket/output/', '-reducer', 'aggregate']} } ], JobFlowRole='EMR_EC2_DefaultRole', ServiceRole='EMR_DefaultRole', ) 

I don’t remember how to do this with boto, but I had problems starting a simple stream job without disabling vmem-check-enabled .

Also, if your script is located somewhere in S3, upload it using -files (adding #filename to the argument will make the downloaded file available as filename in the cluster).

+5
source share

All Articles