Running PySpark and an IDE like Spyder?

Question

Running PySpark and an IDE like Spyder?

I can start PySpark from the terminal line, and everything works fine.

~/spark-1.0.0-bin-hadoop1/bin$ ./pyspark

Welcome to the

       ____ __
      / __ / __ ___ _____ / / __
     _ \ \ / _ \ / _ `/ __ / '_ /
    / __ / .__ / \ _, _ / _ / / _ / \ _ \ version 1.0.0
       / _ /

Using Python version 2.7.6 (default, May 27, 2014 2:50:58 PM)

However, when I try to do this on the Python IDE

 import pyspark

ImportError: no module named pyspark

How to import it, like other Python libraries like numpy, scikit, etc.?

Working in the terminal works fine, I just wanted to work in the IDE.

+10

python-2.7 apache-spark

Vedant Jun 16 '14 at 18:24

source share

4 answers

prismalytics.io · Answer 1 · 2015-02-07 08:37

I wrote this launcher script a while ago for this purpose. I wanted to be able to interact with the pyspark shell from the bpython (1) code completion interpreter and the WING IDE , or any IDE, if so, because they have code and also provide a complete development experience. Learning the Spark core by simply typing "pyspark" is not enough. So I wrote this. It was written in the Cloudera CDH5 environment, but with a little tweaking you can make it work regardless of your environment (even manually installed).

How to use:

 NOTE: You can place all of the following in your .profile (or equivalent). (1) linux$ export MASTER='yarn-client | local[NN] | spark://host:port' (2) linux$ export SPARK_HOME=/usr/lib/spark # Your will vary. (3) linux$ export JAVA_HOME=/usr/java/latest # Your will vary. (4) linux$ export NAMENODE='vps00' # Your will vary. (5) linux$ export PYSTART=${PYTHONSTARTUP} # See in-line commends about the reason for the need for this alias to PYTHONSTARTUP. (6) linux$ export HADOOP_CONF_DIR=/etc/hadoop/conf # Your will vary. This one may not be necessary to set. Try and see. (7) linux$ export HADOOP_HOME=/usr/lib/hadoop # Your will vary. This one may not be necessary to set. Try and see. (8) bpython -i /path/to/script/below # The moment of truth. Note that this is 'bpython' (not just plain 'python', which would not give the code completion you desire). >>> sc <pyspark.context.SparkContext object at 0x2798110> >>>

Now for use with the IDE, you simply determine how to specify the equivalent PYTHONSTARTUP script for this IDE, and set it to '/ path / to / script / below'. For example, as I described in the comments below, for the WING IDE, you simply set the key / value pair “PYTHONSTARTUP = / path / to / script / below” inside the project properties section.

See comments on line for more information.

 #! /usr/bin/env python # -*- coding: utf-8 -*- # # =========================================================================== # Author: Noel Milton Vega (PRISMALYTICS, LLC.) # =========================================================================== # Start-up script for 'python(1)', 'bpython(1)', and Python IDE iterpreters # when you want a 'client-mode' SPARK Shell (ie interactive SPARK shell) # environment either LOCALLY, on a SPARK Standalone Cluster, or on SPARK # YARN cluster. The code-sense/intelligence of bpython(1) and IDEs, in # particular will aid in learning the SPARK core API. # # This script basically (1) first sets up an environment to launch a SPARK # Shell, then (2) launches the SPARK Shell using the 'shell.py' python script # provided in the distribution SPARK_HOME; and finally (3) imports our # favorite Python modules (for convenience; eg numpy, scipy; etc.). # # IMPORTANT: # DON'T RUN THIS SCRIPT DIRECTLY. It is meant to be read in by interpreters # (similar, in that respect, to a PYTHONSTARTUP script). # # Thus, there are two ways to use this file: # # We can't refer to PYTHONSTARTUP inside this file b/c that causes a recursion loop # # when calling this from within IDEs. So in step (0) we alias PYTHONSTARTUP to # # PYSTARTUP at the O/S level, and use that alias here (since no conflict with that). # (0): user$ export PYSTARTUP=${PYTHONSTARTUP} # We can't use PYTHONSTARTUP in this file # (1): user$ export MASTER='yarn-client | local[NN] | spark://host:port' # user$ bpython|python -i /path/to/this/file # # (2): From within your favorite IDE, specify it as your python startup # script. For example, from within a WINGIDE project, set the following # variables within a WING Project: 'Project -> Project Properties': # 'PYTHONSTARTUP=/path/to/this/very/file' # 'MASTER=yarn-client | local[NN] | spark://host:port' # =========================================================================== import sys, os, glob, subprocess, random namenode = os.getenv('NAMENODE') SPARK_HOME = os.getenv('SPARK_HOME') # =========================================================================== # ================================================================================= # This functions emulates the action of "source" or '.' that exists in bash(1), # and can be used to set PYTHON environment variables (in Pythons globals dict). # ================================================================================= def source(script, update=True): proc = subprocess.Popen(". %s; env -0" % script, stdout=subprocess.PIPE, shell=True) output = proc.communicate()[0] env = dict((line.split("=", 1) for line in output.split('\x00') if line)) if update: os.environ.update(env) return env # ================================================================================ # ================================================================================ # Here, we get the name of our current SPARK Assembly JAR file name (locally). We # use that to create a HDFS URL that points to it location in HDFS when using # YARN (ie when 'export MASTER=yarn-client'; we ignore it otherwise). # ================================================================================ # Remember to always upload/update your distribution current SPARK Assembly JAR # to HDFS like this: # $ hdfs dfs -mkdir -p /user/spark/share/lib" # Only necessary to do once! # $ hdfs dfs -rm "/user/spark/share/lib/spark-assembly-*.jar" # Remove old version. # $ hdfs dfs -put ${SPARK_HOME}/assembly/lib/spark-assembly-[0-9]*.jar /user/spark/share/lib/ # ================================================================================ SPARK_JAR_LOCATION = glob.glob(SPARK_HOME + '/lib/' + 'spark-assembly-[0-9]*.jar')[0].split("/")[-1] SPARK_JAR_LOCATION = 'hdfs://' + namenode + ':8020/user/spark/share/lib/' + SPARK_JAR_LOCATION # ================================================================================ # ================================================================================ # Update Pythons globals environment variable dict with necessary environment # variables that the SPARK Shell will be looking for. Some we set explicitly via # an in-line dictionary, as shown below. And the rest are set by 'source'ing the # global SPARK environment file (although we could have included those explicitly # here too, if we preferred not to touch that system-wide file -- and leave it as FCS). # ================================================================================ spark_jar_opt = None MASTER = os.getenv('MASTER') if os.getenv('MASTER') else 'local[8]' if MASTER.startswith('yarn-'): spark_jar_opt = ' -Dspark.yarn.jar=' + SPARK_JAR_LOCATION elif MASTER.startswith('spark://'): pass else: HADOOP_HOME = '' # ================================================================================ # ================================================================================ # Build '--driver-java-options' options for spark-shell, pyspark, or spark-submit. # Many of these are set in '/etc/spark/conf/spark-defaults.conf' (and thus # commented out here, but left here for reference completeness). # ================================================================================ # Default UI port is 4040. The next statement allows us to run multiple SPARK shells. DRIVER_JAVA_OPTIONS = '-Dspark.ui.port=' + str(random.randint(1025, 65535)) DRIVER_JAVA_OPTIONS += spark_jar_opt if spark_jar_opt else '' # ================================================================================ # ================================================================================ # Build PYSPARK_SUBMIT_ARGS (ie the sames ones shown in 'pyspark --help'), and # apply them to the O/S environment. # ================================================================================ DRIVER_JAVA_OPTIONS = "'" + DRIVER_JAVA_OPTIONS + "'" PYSPARK_SUBMIT_ARGS = ' --master ' + MASTER # Remember to set MASTER on UNIX CLI or in the IDE! PYSPARK_SUBMIT_ARGS += ' --driver-java-options ' + DRIVER_JAVA_OPTIONS # Built above. # ================================================================================ os.environ.update(source('/etc/spark/conf/spark-env.sh', update = False)) os.environ.update({ 'PYSPARK_SUBMIT_ARGS' : PYSPARK_SUBMIT_ARGS }) # ================================================================================ # ================================================================================ # Next, adjust 'sys.path' so SPARK Shell has the python modules it needs. # ================================================================================ SPARK_PYTHON_DIR = SPARK_HOME + '/python' PY4J = glob.glob(SPARK_PYTHON_DIR + '/lib/' + 'py4j-*-src.zip')[0].split("/")[-1] sys.path = [SPARK_PYTHON_DIR, SPARK_PYTHON_DIR + '/lib/' + PY4J] + sys.path # ================================================================================ # ================================================================================ # With our environment set, we start the SPARK Shell; and then to that, we add # our favorite Python imports (eg numpy, scipy; etc). # ================================================================================ print('PYSPARK_SUBMIT_ARGS:' + PYSPARK_SUBMIT_ARGS) # For visual debug. execfile(SPARK_HOME + '/python/pyspark/shell.py', globals()) # Start the SPARK Shell. execfile(os.getenv('PYSTARTUP')) # Next, load our favorite Python modules. # ================================================================================

Enjoy and good luck! = :)

gongming wei · Answer 2 · 2015-07-27 09:37

Thanks Ophir YokTon top post, I was finally able to do this with Spark 1.4.1+ Spyder2.3.4.

Here I would like to give one summary about all my steps to do this, I hope that this will help some people in such situations.

Add PYTHONPATH variable to .bashrc. (of course, you can put it in another relavent profile file)

 export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

Make it effective

 source .bashrc

Create one copy of spyder as spyder.py in your spyder bin directory

 cp spyder spyder.py

Launch IDE Spyder with the following command

 spark-submit spyder.py

I have implemented a sample "simple application" from the apache spark and passed the current testing in the spyder environment. refer to the drawing " http://i.stack.imgur.com/xTv6s.gif "

mlworker · Answer 3 · 2014-12-22 13:47

pyspark is probably not in the pythonpath variable. Go to the folder where the pyspark folder is located, and add this folder to your class path.

Ophir Yoktan · Answer 4 · 2015-06-03 13:40

If you just want to import a module, just add it to the python path
If you want to run full scripts from the IDE, you can create a “tool” that uses spark-submit to execute your script from the IDE (instead of just running)
In particular, for spyder (or another python IDE) you can run the IDE from spark-submit

Example:

 spark-submit.cmd c:\Python27\Scripts\spyder.py

Note that I had to rename spyder to spyder.py - it seems that spark submit relies on the extension to distinguish between python, java or scala
add any required parameters to fix-send

Running PySpark and an IDE like Spyder?

More articles: