Firstly, thanks for your question, I used the answer to find a solution to the same problem. I want to answer your comment of August 11, 2015.
You can look at what is happening in the shell behind the laptop for an explanation.
On the first call:
sc=SparkContext()
You will see Spark initialization, as when starting Spark from the shell. So, you initialize sc (by default when starting Spark )
If you call it again, you have a mistake that you mention in your comment. So, I would say that you already initialized the Spark sc context the first time you tried a sentence (or perhaps you twice proposed that sentence).
The reason it works when you remove the SparkContext definition is because sc defined, but the next time you start your IPython laptop, I think you'll have to run sc = SparkContext() once.
To make this clearer, I would say that the code in terms of IPython Notebook will be organized as follows:
One cell that runs once every time the kernel is rebooted to configure the Python environment and initialize the Spark environment:
import os.path import csv from pyspark import SparkContext sc = SparkContext()
The second cell that you run several times for testing purposes:
basedir = os.path.join('data') inputpath = os.path.join('train_set.csv') filename = os.path.join(basedir,inputpath) numpart = 2 train_data = sc.textFile(filename,numpart)
But if you want to use a single cell, you can also call the stop() method on the SparkContext object at the end of your code:
I really recommend O'Reilly's book , The Study of the Glow of High-Speed Data Analysis, by Holden Carau, Andy Conwinski, Patrick Wendell and Matey Zachariah. In particular, chapter 2 for this kind of questions about understanding Core Spark concepts.
Perhaps you know this now, please apologize if this is possible, but it can help someone else!