ImportError: no module named numpy for spark workers

Running pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import impulse in the shell is normal, but it does not work in kmeans. Somehow the performers do not have numpy installed, this is my feeling. I have not found a single good solution anywhere to inform numpy workers. I tried installing PYSPARK_PYTHON, but that didn't work either.

 import numpy features = numpy.load(open("combined_features.npz")) features = features['arr_0'] features.shape features_rdd = sc.parallelize(features, 5000) from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random") 

Stack trace

  org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main command = pickleSer._read_with_length(infile) File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length return self.loads(obj) File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads return pickle.loads(obj) File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module> ImportError: No module named numpy at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) enter code here 
+12
python numpy apache-spark pyspark
source share
9 answers

To use the Spark in Yarn mode, you need to install any dependencies on the machines on which the yarn of the performers begins. This is the only sure way to do this job.

Using Spark with Yarn cluster mode is a completely different story. You can distribute python dependencies with spark-submit.

 spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip 

However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that it makes a heavy climb in C. Because it is installed, you cannot distribute numpy in this way.

+15
source share

http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html

You can also check out this article. He describes your problem pretty well.

+4
source share

numpy is not installed on working (virtual) machines. If you are using anaconda, it is very convenient to load such python dependencies when deploying the application in cluster mode. (Thus, there is no need to install numpy or other modules on each machine, instead they should be in your anaconda). First archive your anaconda and put the zip file in the cluster, and then you can send the task using the following script.

  spark-submit \ --master yarn \ --deploy-mode cluster \ --archives hdfs://host/path/to/anaconda.zip#python-env --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python app_main.py 

Yarn will copy the anaconda.zip file from the hdfs path to each employee and use this pthon-env / anaconda / bin / python to complete the tasks.

See Starting PySpark with Virtualenv for more information.

+1
source share

I had a similar problem, but I don’t think you need to install PYSPARK_PYTHON instead, just install numpy on the working computer (apt-get or yum). The error will also tell you which device was missing the import.

0
source share
 sudo pip install numpy 

It seems to reinstall numpy with "sudo" and this module can be found.

0
source share

I had the same problem. Try installing numpy on pip3 if you are using Python3

pip3 install numpy

0
source share

It is as simple as the error: "ImportError: no module named numpy".

install pip and then numpy:

  1. CD to pyspark directory. You will find the path by which the error is issued. The path shown below is for my sandbox

     /usr/hdp/current/spark-client/python/lib/pyspark 
  2. Set item

     yum install python-pip 
  3. install numpy

     pip install numpy 
  4. update item (this step may not be necessary).

     pip install --upgrade pip 
0
source share

You should know that numpy and even the master itself must be installed on each employee (depending on the location of your component)

Also, make sure that the pip install numpy command is run from the root account (sudo is not enough) after setting umask to 022 ( umask 022 ) so that it cascades the rights to the Spark (or Zeppelin) user.

0
source share

What solved for me (on mac ) is this manual itself (which also explains how to run python through Jupyter Notebooks - https://medium.com/@yajieli/install-spark-pyspark-on-mac-and-fix - because of some common errors-355a9050f735-

In a nutshell: (assuming you installed spark with brew install spark )

  1. Find SPARK_PATH using - brew info apache-spark
  2. Add these lines to your ~/.bash_profile
 # Spark and Python ###### export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1 export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" #For python 3, You have to add the line below or you will get an error export PYSPARK_PYTHON=python3 alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]' ###### 
  1. You can open Jupyter Notebook simply by calling pyspark : pyspark

And just remember that you do not need to set the Spark Context but simply call:

 sc = SparkContext.getOrCreate() 
0
source share

All Articles