Running PySpark and an IDE like Spyder?

I can start PySpark from the terminal line, and everything works fine.

~/spark-1.0.0-bin-hadoop1/bin$ ./pyspark 

Welcome to the

       ____ __
      / __ / __ ___ _____ / / __
     _ \ \ / _ \ / _ `/ __ / '_ /
    / __ / .__ / \ _, _ / _ / / _ / \ _ \ version 1.0.0
       / _ /

Using Python version 2.7.6 (default, May 27, 2014 2:50:58 PM)

However, when I try to do this on the Python IDE

 import pyspark 

ImportError: no module named pyspark

How to import it, like other Python libraries like numpy, scikit, etc.?

Working in the terminal works fine, I just wanted to work in the IDE.

+10
apache-spark
Jun 16 '14 at 18:24
source share
4 answers

I wrote this launcher script a while ago for this purpose. I wanted to be able to interact with the pyspark shell from the bpython (1) code completion interpreter and the WING IDE , or any IDE, if so, because they have code and also provide a complete development experience. Learning the Spark core by simply typing "pyspark" is not enough. So I wrote this. It was written in the Cloudera CDH5 environment, but with a little tweaking you can make it work regardless of your environment (even manually installed).

How to use:

 NOTE: You can place all of the following in your .profile (or equivalent). (1) linux$ export MASTER='yarn-client | local[NN] | spark://host:port' (2) linux$ export SPARK_HOME=/usr/lib/spark # Your will vary. (3) linux$ export JAVA_HOME=/usr/java/latest # Your will vary. (4) linux$ export NAMENODE='vps00' # Your will vary. (5) linux$ export PYSTART=${PYTHONSTARTUP} # See in-line commends about the reason for the need for this alias to PYTHONSTARTUP. (6) linux$ export HADOOP_CONF_DIR=/etc/hadoop/conf # Your will vary. This one may not be necessary to set. Try and see. (7) linux$ export HADOOP_HOME=/usr/lib/hadoop # Your will vary. This one may not be necessary to set. Try and see. (8) bpython -i /path/to/script/below # The moment of truth. Note that this is 'bpython' (not just plain 'python', which would not give the code completion you desire). >>> sc <pyspark.context.SparkContext object at 0x2798110> >>> 

Now for use with the IDE, you simply determine how to specify the equivalent PYTHONSTARTUP script for this IDE, and set it to '/ path / to / script / below'. For example, as I described in the comments below, for the WING IDE, you simply set the key / value pair “PYTHONSTARTUP = / path / to / script / below” inside the project properties section.

See comments on line for more information.

 #! /usr/bin/env python # -*- coding: utf-8 -*- # # =========================================================================== # Author: Noel Milton Vega (PRISMALYTICS, LLC.) # =========================================================================== # Start-up script for 'python(1)', 'bpython(1)', and Python IDE iterpreters # when you want a 'client-mode' SPARK Shell (ie interactive SPARK shell) # environment either LOCALLY, on a SPARK Standalone Cluster, or on SPARK # YARN cluster. The code-sense/intelligence of bpython(1) and IDEs, in # particular will aid in learning the SPARK core API. # # This script basically (1) first sets up an environment to launch a SPARK # Shell, then (2) launches the SPARK Shell using the 'shell.py' python script # provided in the distribution SPARK_HOME; and finally (3) imports our # favorite Python modules (for convenience; eg numpy, scipy; etc.). # # IMPORTANT: # DON'T RUN THIS SCRIPT DIRECTLY. It is meant to be read in by interpreters # (similar, in that respect, to a PYTHONSTARTUP script). # # Thus, there are two ways to use this file: # # We can't refer to PYTHONSTARTUP inside this file b/c that causes a recursion loop # # when calling this from within IDEs. So in step (0) we alias PYTHONSTARTUP to # # PYSTARTUP at the O/S level, and use that alias here (since no conflict with that). # (0): user$ export PYSTARTUP=${PYTHONSTARTUP} # We can't use PYTHONSTARTUP in this file # (1): user$ export MASTER='yarn-client | local[NN] | spark://host:port' # user$ bpython|python -i /path/to/this/file # # (2): From within your favorite IDE, specify it as your python startup # script. For example, from within a WINGIDE project, set the following # variables within a WING Project: 'Project -> Project Properties': # 'PYTHONSTARTUP=/path/to/this/very/file' # 'MASTER=yarn-client | local[NN] | spark://host:port' # =========================================================================== import sys, os, glob, subprocess, random namenode = os.getenv('NAMENODE') SPARK_HOME = os.getenv('SPARK_HOME') # =========================================================================== # ================================================================================= # This functions emulates the action of "source" or '.' that exists in bash(1), # and can be used to set PYTHON environment variables (in Pythons globals dict). # ================================================================================= def source(script, update=True): proc = subprocess.Popen(". %s; env -0" % script, stdout=subprocess.PIPE, shell=True) output = proc.communicate()[0] env = dict((line.split("=", 1) for line in output.split('\x00') if line)) if update: os.environ.update(env) return env # ================================================================================ # ================================================================================ # Here, we get the name of our current SPARK Assembly JAR file name (locally). We # use that to create a HDFS URL that points to it location in HDFS when using # YARN (ie when 'export MASTER=yarn-client'; we ignore it otherwise). # ================================================================================ # Remember to always upload/update your distribution current SPARK Assembly JAR # to HDFS like this: # $ hdfs dfs -mkdir -p /user/spark/share/lib" # Only necessary to do once! # $ hdfs dfs -rm "/user/spark/share/lib/spark-assembly-*.jar" # Remove old version. # $ hdfs dfs -put ${SPARK_HOME}/assembly/lib/spark-assembly-[0-9]*.jar /user/spark/share/lib/ # ================================================================================ SPARK_JAR_LOCATION = glob.glob(SPARK_HOME + '/lib/' + 'spark-assembly-[0-9]*.jar')[0].split("/")[-1] SPARK_JAR_LOCATION = 'hdfs://' + namenode + ':8020/user/spark/share/lib/' + SPARK_JAR_LOCATION # ================================================================================ # ================================================================================ # Update Pythons globals environment variable dict with necessary environment # variables that the SPARK Shell will be looking for. Some we set explicitly via # an in-line dictionary, as shown below. And the rest are set by 'source'ing the # global SPARK environment file (although we could have included those explicitly # here too, if we preferred not to touch that system-wide file -- and leave it as FCS). # ================================================================================ spark_jar_opt = None MASTER = os.getenv('MASTER') if os.getenv('MASTER') else 'local[8]' if MASTER.startswith('yarn-'): spark_jar_opt = ' -Dspark.yarn.jar=' + SPARK_JAR_LOCATION elif MASTER.startswith('spark://'): pass else: HADOOP_HOME = '' # ================================================================================ # ================================================================================ # Build '--driver-java-options' options for spark-shell, pyspark, or spark-submit. # Many of these are set in '/etc/spark/conf/spark-defaults.conf' (and thus # commented out here, but left here for reference completeness). # ================================================================================ # Default UI port is 4040. The next statement allows us to run multiple SPARK shells. DRIVER_JAVA_OPTIONS = '-Dspark.ui.port=' + str(random.randint(1025, 65535)) DRIVER_JAVA_OPTIONS += spark_jar_opt if spark_jar_opt else '' # ================================================================================ # ================================================================================ # Build PYSPARK_SUBMIT_ARGS (ie the sames ones shown in 'pyspark --help'), and # apply them to the O/S environment. # ================================================================================ DRIVER_JAVA_OPTIONS = "'" + DRIVER_JAVA_OPTIONS + "'" PYSPARK_SUBMIT_ARGS = ' --master ' + MASTER # Remember to set MASTER on UNIX CLI or in the IDE! PYSPARK_SUBMIT_ARGS += ' --driver-java-options ' + DRIVER_JAVA_OPTIONS # Built above. # ================================================================================ os.environ.update(source('/etc/spark/conf/spark-env.sh', update = False)) os.environ.update({ 'PYSPARK_SUBMIT_ARGS' : PYSPARK_SUBMIT_ARGS }) # ================================================================================ # ================================================================================ # Next, adjust 'sys.path' so SPARK Shell has the python modules it needs. # ================================================================================ SPARK_PYTHON_DIR = SPARK_HOME + '/python' PY4J = glob.glob(SPARK_PYTHON_DIR + '/lib/' + 'py4j-*-src.zip')[0].split("/")[-1] sys.path = [SPARK_PYTHON_DIR, SPARK_PYTHON_DIR + '/lib/' + PY4J] + sys.path # ================================================================================ # ================================================================================ # With our environment set, we start the SPARK Shell; and then to that, we add # our favorite Python imports (eg numpy, scipy; etc). # ================================================================================ print('PYSPARK_SUBMIT_ARGS:' + PYSPARK_SUBMIT_ARGS) # For visual debug. execfile(SPARK_HOME + '/python/pyspark/shell.py', globals()) # Start the SPARK Shell. execfile(os.getenv('PYSTARTUP')) # Next, load our favorite Python modules. # ================================================================================ 

Enjoy and good luck! = :)

+4
Feb 07 '15 at 8:37
source share

Thanks Ophir YokTon top post, I was finally able to do this with Spark 1.4.1+ Spyder2.3.4.

Here I would like to give one summary about all my steps to do this, I hope that this will help some people in such situations.

  • Add PYTHONPATH variable to .bashrc. (of course, you can put it in another relavent profile file)
 export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
  1. Make it effective
 source .bashrc 
  1. Create one copy of spyder as spyder.py in your spyder bin directory
 cp spyder spyder.py 
  1. Launch IDE Spyder with the following command
 spark-submit spyder.py 

I have implemented a sample "simple application" from the apache spark and passed the current testing in the spyder environment. refer to the drawing " http://i.stack.imgur.com/xTv6s.gif "

+2
Jul 27 '15 at 9:37
source share

pyspark is probably not in the pythonpath variable. Go to the folder where the pyspark folder is located, and add this folder to your class path.

0
Dec 22 '14 at 1:47
source share
  • If you just want to import a module, just add it to the python path

  • If you want to run full scripts from the IDE, you can create a “tool” that uses spark-submit to execute your script from the IDE (instead of just running)

  • In particular, for spyder (or another python IDE) you can run the IDE from spark-submit

Example:

 spark-submit.cmd c:\Python27\Scripts\spyder.py 
  • Note that I had to rename spyder to spyder.py - it seems that spark submit relies on the extension to distinguish between python, java or scala
  • add any required parameters to fix-send
0
Jun 03 '15 at 13:40
source share



All Articles