Creating a DataFrame from a list of tuples using pyspark

I work with data extracted from SFDC using the simple-salesforce package. I am using Python3 for scripting and Spark 1.5.2.

I created rdd containing the following data:

[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')] [('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')] [('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')] ... 

This data is in an RDD called v_rdd

My diagram looks like this:

 StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true))) 

I am trying to create a DataFrame from this RDD:

 sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema) 

I print my DataFrame:

 sqlDataFrame.printSchema() 

And get the following:

 +--------------------+--------------------+--------------------+ | Id| PackSize| Name| +--------------------+--------------------+--------------------+ |[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...| |[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...| |[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...| 

I expect to see actual data, for example:

 +------------------+------------------+--------------------+ | Id|PackSize| Name| +------------------+------------------+--------------------+ |a0w1a0000003xB1A | 1.0| A | |a0w1a0000003xAAI | 1.0| B | |a0w1a00000xB3AAI | 30.0| C | 

Could you help me determine what I am doing wrong here.

My Python script is long, I'm not sure it will be convenient for people to sift it, so I posted only the parts that I have a problem with.

Thanks for the ton in advance!

+6
source share
1 answer

Hey, you could provide a working example next time. That would be easier.

The way you represent your RDD is basically weird for creating a DataFrame. So you create DF according to the Spark documentation.

 >>> l = [('Alice', 1)] >>> sqlContext.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] >>> sqlContext.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)] 

So, in relation to your example, you can create the desired result as follows:

 # Your data at the moment data = sc.parallelize([ [('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')], [('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')], [('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')] ]) # Convert to tuple data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1])) # Define schema schema = StructType([ StructField("Id", StringType(), True), StructField("Packsize", StringType(), True), StructField("Name", StringType(), True) ]) # Create dataframe DF = sqlContext.createDataFrame(data_converted, schema) # Output DF.show() +----------------+--------+----+ | Id|Packsize|Name| +----------------+--------+----+ |a0w1a0000003xB1A| 1.0| A| |a0w1a0000003xAAI| 1.0| B| |a0w1a00000xB3AAI| 30.0| C| +----------------+--------+----+ 

Hope this helps

+13
source

All Articles