How to run Python Spark code on Amazon Aws?

I have written python code in sparks and I want to run it on Amazon Elastic Map.

My code works fine on my local machine, but I'm a bit confused about how to run it on Amazon AWS?

In particular, how do I pass my python code to the node master? Do I need to copy Python code into my s3 bucket and execute it from there? Or, should I ssh enter the master and scp my python code into the spark folder in Master?

At the moment, I tried to run the code locally on my terminal and connect to the cluster address (I did this by reading the output of the -help flag of the spark, so I could skip a few steps here)

./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \ --master spark://hadoop@ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \ mypythoncode.py 

I tried it with and without my rights file, i.e.

 -i permissionsfile.pem 

However, it fails, and the stack trace shows something in the lines

 Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ...... ...... 

Is my approach right and I need to solve access problems in order to walk or am I heading in the wrong direction?

What is the right way to do this?

I searched a lot on youtube but couldn't find any tutorials on how to launch Spark on Amazon EMR.

If that helps, the dataset I'm working on is part of the Amazon public dataset.

+8
amazon-s3 amazon-web-services apache-spark pyspark
source share
2 answers
  • go to EMR, create a new cluster ... [recommendation: start with 1 node just for testing purposes].
  • Check the box to install Spark , you can uncheck the boxes from other fields if you do not need these additional programs.
  • additionally configure the cluster by selecting VPC and a security key (ssh key, aka pem key)
  • wait for the download. Once your cluster says “wait”, you can continue.
  • [fix feed via GUI] in the GUI, you can add Step and select Spark job and upload the spark file to S3 , and then choose the path to that recently downloaded S3 file. After its launch, it will either succeed or fail. If this fails, wait a moment, and then click “view logs” on this line “Step” in the list of steps. Continue customizing your script until you get it working.

    [command-line transfer] SSH to the node driver, following the ssh instructions at the top of the page. Inside, use a command line text editor to create a new file and paste the contents of your script into. Then fix your new_NewFile.py. If this fails, you will see the error output directly to the console. Update your script and run it again. Do this until you earn as expected.

Note. doing tasks from your local computer to the remote machine is unpleasant because you can really make the local spark instance responsible for costly computing and data transmission over the network. That's why you want to send AWS EMR jobs from EMR.

+6
source share

There are two typical ways to run a job in an Amazon EMR cluster (whether it be Spark or other types of jobs):

If you have Apache Zeppelin installed in your EMR cluster , you can use a web browser to interact with Spark.

The error you are experiencing indicates that the files are accessed through the s3n: protocol, which requires AWS credentials. If instead the files were accessible via s3: I suspect that the credentials will be obtained from the IAM role, which is automatically assigned to the nodes in the cluster, and this error will be resolved.

+2
source share

All Articles