How to install my code and dependencies in an AWS Spark cluster?

I can create a Spark cluster on AWS as described here .

However, my own Python and pip code libraries should run on the main and working ones. This is a lot of code, and the installation process in pip also compiles some native libraries, so I can’t just get Spark to distribute this code at runtime using methods such as registering pin requirements for a file with spark_context or the py argument of spark-submit files .

Of course, I could run a bash script right after running aws emr create-cluster , but I am wondering if there is a more automatic way so that I can avoid serving a large bash script for installation.

So what is the best way to configure clusters to include my code and dependencies?

+5
source share

All Articles