Create PySpark Profile for IPython

I follow this link http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/ to create a PySpark profile for IPython.

00-pyspark-setup.py # Configure the necessary Spark environment import os import sys spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, spark_home + "\python") # Add the py4j to the path. # You may need to change the version number to match your install sys.path.insert(0, os.path.join(spark_home, '\python\lib\py4j-0.8.2.1-src.zip')) # Initialize PySpark to predefine the SparkContext variable 'sc' execfile(os.path.join(spark_home, '\python\pyspark\shell.py')) 

My problem is, when I type sc in ipython-notebook, I got '' , I should see output similar to <pyspark.context.SparkContext at 0x1097e8e90>.

Any idea on how to resolve it?

+7
python apache-spark
source share
4 answers

I tried to do the same, but had problems. Now I am using findspark ( https://github.com/minrk/findspark ). You can install it using pip (see https://pypi.python.org/pypi/findspark/ ):

 $ pip install findspark 

And then, inside the notebook:

 import findspark findspark.init() import pyspark sc = pyspark.SparkContext(appName="myAppName") 

If you want to avoid this pattern, you can put above 4 lines in 00-pyspark-setup.py .

(Now I have Spark 1.4.1 and findpark 0.0.5.)

+7
source share

Try setting the correct value to the variable SPARK_LOCAL_IP , for example:

 export SPARK_LOCAL_IP="$(hostname -f)" 

before starting ipython notebook --profile=pyspark .

If this does not help, try debugging the environment by setting up the script:

 python 00-pyspark-setup.py 

Perhaps you can find some error lines this way and debug them.

0
source share

Are you at the windows? I am dealing with the same things, and several things have helped. In 00-pyspark-setup.py change this line (map your path to your spark folder)

 # Configure the environment if 'SPARK_HOME' not in os.environ: print 'environment spark not set' os.environ['SPARK_HOME'] = 'C:/spark-1.4.1-bin-hadoop2.6' 

I am sure that you have added a new environment variable, if not, it will manually set it.

The next thing I noticed is that if you are using ipython 4 (the latter), the configuration files do not work as you see in all the tutorials. You can try if your configuration files are called by adding a print statement or just messing them up so that the error is reset.

I am using a lower version of iPython (3) and I am calling it using

 ipython notebook --profile=pyspark 
0
source share

Change 00-pyspark-setup.py to:

 # Configure the necessary Spark environment import os # Spark home spark_home = os.environ.get("SPARK_HOME") ######## CODE ADDED ######## os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[2] pyspark-shell" ######## END OF ADDED CODE ######### sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) # Initialize PySpark to predefine the SparkContext variable 'sc' execfile(os.path.join(spark_home, 'python/pyspark/shell.py')) 

Basically, the added code sets the PYSPARK_SUBMIT_ARGS environment variable to

--master local[2] pyspark-shell , which works offline in Spark 1.6.

Now run ipython notebook again. Run os.environ["PYSPARK_SUBMIT_ARGS"] to check if its value is set correctly. If so, type sc to give you the expected result, for example <pyspark.context.SparkContext at 0x1097e8e90>

0
source share

All Articles