"unbound method textFile () must be called with the SparkContext instance as the first argument (str was received instead)"

Question

"unbound method textFile () must be called with the SparkContext instance as the first argument (str was received instead)"

I am trying to use the sc.textFile () function to read a csv file. However, I get "an unrelated textFile () method that should be called with the SparkContext instance as the first argument (received instead of the str instance)". I checked in stackoverflow for possible answers, but couldn't find the answers. Please, help

I am using below script in iPython Notebook

import os.path from pyspark import SparkContext import csv basedir = os.path.join('data') inputpath = os.path.join('train_set.csv') filename = os.path.join(basedir,inputpath) numpart = 2 sc = SparkContext train_data = sc.textFile(filename,numpart)

To clarify, basedir ('data') is the folder where the csv file is located. Please, help

+5

python apache-spark pyspark

Thomas Joseph Aug 10 '15 at 13:35

source share

1 answer

Pierre cordier · Answer 1 · 2015-09-24T17:18:39+0000

Firstly, thanks for your question, I used the answer to find a solution to the same problem. I want to answer your comment of August 11, 2015.

You can look at what is happening in the shell behind the laptop for an explanation.

On the first call:

 sc=SparkContext()

You will see Spark initialization, as when starting Spark from the shell. So, you initialize sc (by default when starting Spark )

If you call it again, you have a mistake that you mention in your comment. So, I would say that you already initialized the Spark sc context the first time you tried a sentence (or perhaps you twice proposed that sentence).

The reason it works when you remove the SparkContext definition is because sc defined, but the next time you start your IPython laptop, I think you'll have to run sc = SparkContext() once.

To make this clearer, I would say that the code in terms of IPython Notebook will be organized as follows:

One cell that runs once every time the kernel is rebooted to configure the Python environment and initialize the Spark environment:

 import os.path import csv from pyspark import SparkContext sc = SparkContext()

The second cell that you run several times for testing purposes:

 basedir = os.path.join('data') inputpath = os.path.join('train_set.csv') filename = os.path.join(basedir,inputpath) numpart = 2 train_data = sc.textFile(filename,numpart)

But if you want to use a single cell, you can also call the stop() method on the SparkContext object at the end of your code:

 #your initialization import os.path import csv from pyspark import SparkContext sc = SparkContext() #your job basedir = os.path.join('data') inputpath = os.path.join('train_set.csv') filename = os.path.join(basedir,inputpath) numpart = 2 train_data = sc.textFile(filename,numpart) #closing the SparkContext sc.stop()

I really recommend O'Reilly's book , The Study of the Glow of High-Speed Data Analysis, by Holden Carau, Andy Conwinski, Patrick Wendell and Matey Zachariah. In particular, chapter 2 for this kind of questions about understanding Core Spark concepts.

Perhaps you know this now, please apologize if this is possible, but it can help someone else!

"unbound method textFile () must be called with the SparkContext instance as the first argument (str was received instead)"

To clarify, basedir ('data') is the folder where the csv file is located. Please, help

More articles: