Numerical and static binding

I run Spark programs on a large cluster (for which I do not have administrative privileges). numpy not installed on production nodes. Therefore, I linked numpy to my program, but I get the following error:

 Traceback (most recent call last): File "/home/user/spark-script.py", line 12, in <module> import numpy File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module> ImportError: cannot import name multiarray 

The script is actually quite simple:

 from pyspark import SparkConf, SparkContext sc = SparkContext() sc.addPyFile('numpy.zip') import numpy a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90])) print a.collect() 

I understand that the error occurs because numpy dynamically loads the multiarray.so dependency, and even if my numpy.zip file contains the multiarray.so file, because dynamic loading does not work with Apache Spark . Why is that? And how do you create a standalone numpy module with static binding?

Thanks.

+6
python numpy apache-spark pyspark
source share
1 answer

There are at least two problems with your approach, and both can be reduced to the simple fact that NumPy is a heavyweight dependency.

  • Debian packages come primarily with several dependencies, including libgfortran , libblas , liblapack and libquadmath . Therefore, you cannot just copy the NumPy installation and expect that everything will work (to be honest, you should not do anything like this if it is not). Theoretically, you can try to build it using static links, and thus send it with all the dependencies, but it gets into the second problem.

  • NumPy is pretty big on its own. Although 20 MB does not look particularly impressive, and with all the dependencies, it should not exceed 40 MB, it should be sent to the workers every time you start your work. The more workers you have, the worse it turns out. If you decide that you need SciPy or SciKit, it can get a lot worse.

Perhaps this makes NumPy a really bad candidate to submit using the pyFile method.

If you did not have direct access to the workers, but all the dependencies were present, including the header files and the static library, you can simply try to install NumPy in user space from the task itself (it is assumed that pip ) with something like this:

 try: import numpy as np expect ImportError: import pip pip.main(["install", "--user", "numpy"]) import numpy as np 

You will find other variations of this method in How to install and import Python modules at runtime?

Since you have access to working, a much better solution is to create a separate Python environment. Probably the easiest approach is to use Anaconda , which can also be used for package dependencies other than Python, and is system independent. You can easily automate this task with tools like Ansible or Fabric, it does not require administrative privileges, and all you really need is bash and some way to get basic installers (wget, curl, rsync, scp).

See also: sending python modules in pyspark to other nodes?

+6
source share

All Articles