Amazon EMR Pyspark not found

I created an Amazon EMR cluster with Spark already on it. When I start pyspark from the terminal, it goes to the pyspark terminal when I ssh into my cluster.

I uploaded the file using scp, and when I try to run it with python FileName.py, I get an import error:

from pyspark import SparkContext ImportError: No module named pyspark 

How to fix it?

+5
source share
3 answers

You probably need to add pyspark files to the path. I usually use the following function.

 def configure_spark(spark_home=None, pyspark_python=None): spark_home = spark_home or "/path/to/default/spark/home" os.environ['SPARK_HOME'] = spark_home # Add the PySpark directories to the Python path: sys.path.insert(1, os.path.join(spark_home, 'python')) sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark')) sys.path.insert(1, os.path.join(spark_home, 'python', 'build')) # If PySpark isn't specified, use currently running Python binary: pyspark_python = pyspark_python or sys.executable os.environ['PYSPARK_PYTHON'] = pyspark_python 

Then you can call the function before importing pyspark:

 configure_spark('/path/to/spark/home') from pyspark import SparkContext 

The spark home on the EMR node should be something like /home/hadoop/spark . See https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 for details.

+4
source

I add the following lines to ~/.bashrc for emr 4.3:

 export SPARK_HOME=/usr/lib/spark export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH 

Run source ~/.bashrc and you should be good.

+1
source

You can execute the file directly, as is, from the command line with the following command:

 spark-submit FileName.py 
-1
source

All Articles