I run Spark programs on a large cluster (for which I do not have administrative privileges). numpy not installed on production nodes. Therefore, I linked numpy to my program, but I get the following error:
Traceback (most recent call last): File "/home/user/spark-script.py", line 12, in <module> import numpy File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module> ImportError: cannot import name multiarray
The script is actually quite simple:
from pyspark import SparkConf, SparkContext sc = SparkContext() sc.addPyFile('numpy.zip') import numpy a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90])) print a.collect()
I understand that the error occurs because numpy dynamically loads the multiarray.so dependency, and even if my numpy.zip file contains the multiarray.so file, because dynamic loading does not work with Apache Spark . Why is that? And how do you create a standalone numpy module with static binding?
Thanks.
python numpy apache-spark pyspark
abhinavkulkarni
source share