No module named numpy when applying sparks

Im sparking a python file that imports numpy, but Im getting a no module named numpy error.

 $ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py Traceback (most recent call last): File "/usr/local/www/my_numpy_als.py", line 13, in <module> from pyspark.mllib.recommendation import ALS File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module> import numpy ImportError: No module named numpy 

I thought that I would draw an egg for a few pigment files, but it’s hard for me to figure out how to build this egg. But then it occurred to me that pyspark itself was using numpy. It would be foolish to pull my own version of numpy.

Any idea on what needs to be done here?

+5
source share
3 answers

It seems like Spark is using a version of Python that doesn't have numpy . This may be because you are working in a virtual environment.

Try the following:

 # The following is for specifying a Python version for PySpark. Here we # use the currently calling Python version. # This is handy for when we are using a virtualenv, for example, because # otherwise Spark would choose the default system Python version. os.environ['PYSPARK_PYTHON'] = sys.executable 
+3
source

Sometimes, when importing some libraries, your namespace is polluted by numpy functions. Functions such as min , max and sum are especially susceptible to this contamination. If in doubt, find the calls to these functions and replace these calls with __builtin__.sum , etc. Doing this will sometimes be faster than finding a source of pollution.

0
source

I succeeded by installing numpy on all emr nodes, setting up a small boot script that contains the following (among other things).

#!/bin/bash -xe sudo yum install python-numpy python-scipy -y

Then configure the boot script file that will be run when your cluster starts up by adding the following parameter to the aws emr command (the following example gives the bootstrap script argument)

--bootstrap-actions Path=s3://some-bucket/keylocation/bootstrap.sh,Name=setup_dependencies,Args=[s3://some-bucket]

This can be used when automatically configuring a cluster with DataPipeline.

0
source

All Articles