Type-related errors can be avoided by overlaying the schema as follows:
note : a text file (test.csv) was created with the source data (as indicated above) and hypothetical column names ("col1", "col2", ..., "col25") were inserted.
import pyspark from pyspark.sql import SparkSession import pandas as pd spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate() pdDF = pd.read_csv("test.csv")
the contents of the pandas data frame:
pdDF col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 ... col16 col17 col18 col19 col20 col21 col22 col23 col24 col25 0 10000001 1 0 1 12:35 OK 10002 1 0 9 ... 3 9 0 0 1 1 0 0 4 543 1 10000001 2 0 1 12:36 OK 10002 1 0 9 ... 3 9 2 1 1 3 1 3 2 611 2 10000002 1 0 4 12:19 PA 10003 1 1 7 ... 2 15 2 0 2 3 1 2 2 691
Next, create a diagram:
from pyspark.sql.types import * mySchema = StructType([ StructField("Col1", LongType(), True)\ ,StructField("Col2", IntegerType(), True)\ ,StructField("Col3", IntegerType(), True)\ ,StructField("Col4", IntegerType(), True)\ ,StructField("Col5", StringType(), True)\ ,StructField("Col6", StringType(), True)\ ,StructField("Col7", IntegerType(), True)\ ,StructField("Col8", IntegerType(), True)\ ,StructField("Col9", IntegerType(), True)\ ,StructField("Col10", IntegerType(), True)\ ,StructField("Col11", StringType(), True)\ ,StructField("Col12", StringType(), True)\ ,StructField("Col13", IntegerType(), True)\ ,StructField("Col14", IntegerType(), True)\ ,StructField("Col15", IntegerType(), True)\ ,StructField("Col16", IntegerType(), True)\ ,StructField("Col17", IntegerType(), True)\ ,StructField("Col18", IntegerType(), True)\ ,StructField("Col19", IntegerType(), True)\ ,StructField("Col20", IntegerType(), True)\ ,StructField("Col21", IntegerType(), True)\ ,StructField("Col22", IntegerType(), True)\ ,StructField("Col23", IntegerType(), True)\ ,StructField("Col24", IntegerType(), True)\ ,StructField("Col25", IntegerType(), True)])
Note : True (implies a nullable value)
create a pyspark data frame:
df = spark.createDataFrame(pdDF,schema=mySchema)
confirm that the pandas data frame is now the pyspark data frame:
type(df)
output:
pyspark.sql.dataframe.DataFrame
Aside :
To answer Kate's comment below - to superimpose a general (string) diagram, you can do the following:
df=spark.createDataFrame(pdDF.astype(str))