Creating a DataFrame from a list of tuples using pyspark

Question

Creating a DataFrame from a list of tuples using pyspark

I work with data extracted from SFDC using the simple-salesforce package. I am using Python3 for scripting and Spark 1.5.2.

I created rdd containing the following data:

[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')] [('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')] [('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')] ...

This data is in an RDD called v_rdd

My diagram looks like this:

 StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true)))

I am trying to create a DataFrame from this RDD:

 sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema)

I print my DataFrame:

 sqlDataFrame.printSchema()

And get the following:

 +--------------------+--------------------+--------------------+ | Id| PackSize| Name| +--------------------+--------------------+--------------------+ |[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...| |[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...| |[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|

I expect to see actual data, for example:

 +------------------+------------------+--------------------+ | Id|PackSize| Name| +------------------+------------------+--------------------+ |a0w1a0000003xB1A | 1.0| A | |a0w1a0000003xAAI | 1.0| B | |a0w1a00000xB3AAI | 30.0| C |

Could you help me determine what I am doing wrong here.

My Python script is long, I'm not sure it will be convenient for people to sift it, so I posted only the parts that I have a problem with.

Thanks for the ton in advance!

+6

python-3.x pyspark spark-dataframe

Pit Jan 25 '16 at 20:00

source share

1 answer

Dat tran · Answer 1 · 2016-01-26T08:07:47+0000

Hey, you could provide a working example next time. That would be easier.

The way you represent your RDD is basically weird for creating a DataFrame. So you create DF according to the Spark documentation.

 >>> l = [('Alice', 1)] >>> sqlContext.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] >>> sqlContext.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)]

So, in relation to your example, you can create the desired result as follows:

 # Your data at the moment data = sc.parallelize([ [('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')], [('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')], [('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')] ]) # Convert to tuple data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1])) # Define schema schema = StructType([ StructField("Id", StringType(), True), StructField("Packsize", StringType(), True), StructField("Name", StringType(), True) ]) # Create dataframe DF = sqlContext.createDataFrame(data_converted, schema) # Output DF.show() +----------------+--------+----+ | Id|Packsize|Name| +----------------+--------+----+ |a0w1a0000003xB1A| 1.0| A| |a0w1a0000003xAAI| 1.0| B| |a0w1a00000xB3AAI| 30.0| C| +----------------+--------+----+

Hope this helps

Creating a DataFrame from a list of tuples using pyspark

More articles: