I have written python code in sparks and I want to run it on Amazon Elastic Map.
My code works fine on my local machine, but I'm a bit confused about how to run it on Amazon AWS?
In particular, how do I pass my python code to the node master? Do I need to copy Python code into my s3 bucket and execute it from there? Or, should I ssh enter the master and scp my python code into the spark folder in Master?
At the moment, I tried to run the code locally on my terminal and connect to the cluster address (I did this by reading the output of the -help flag of the spark, so I could skip a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \ --master spark://hadoop@ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \ mypythoncode.py
I tried it with and without my rights file, i.e.
-i permissionsfile.pem
However, it fails, and the stack trace shows something in the lines
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ...... ......
Is my approach right and I need to solve access problems in order to walk or am I heading in the wrong direction?
What is the right way to do this?
I searched a lot on youtube but couldn't find any tutorials on how to launch Spark on Amazon EMR.
If that helps, the dataset I'm working on is part of the Amazon public dataset.
amazon-s3 amazon-web-services apache-spark pyspark
Piyush
source share