Create Spark Dataset from CSV File

Question

Create Spark Dataset from CSV File

I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file:

name,state,number_of_people,coolness_index trenton,nj,"10","4.5" bedford,ny,"20","3.3" patterson,nj,"30","2.2" camden,nj,"40","8.8"

Here is the code to create the dataset:

 var location = "s3a://path_to_csv" case class City(name: String, state: String, number_of_people: Long) val cities = spark.read .option("header", "true") .option("charset", "UTF8") .option("delimiter",",") .csv(location) .as[City]

Here is the error message: "It is not possible to cast number_of_people from a string to bigint because it can crop"

The Databricks talks about creating datasets and about this particular error message in this blog post .

Encoders eagerly check to see if your data matches the expected pattern by providing useful error messages before attempting to process TB data erroneously. For example, if we try to use a data type that is too small, so converting to an object will result in truncation (i.e. numStudents is larger than a byte that contains a maximum value of 255), the analyzer will throw an Analysis exception.

I am using the Long type, so I did not expect to see this error message.

+5

apache-spark apache-spark-dataset

Powers Sep 16 '16 at 1:02

source share

2 answers

with your case City class (name: String, state: String, number_of_people: Long), you just need one line

 private val cityEncoder = Seq(City("", "", 0)).toDS

then you code

 val cities = spark.read .option("header", "true") .option("charset", "UTF8") .option("delimiter",",") .csv(location) .as[City]

will work.

This is the official source [ http://spark.apache.org/docs/latest/sql-programming-guide.html#overview†[1]

0

mingzhao.pro Jul 27 '17 at 10:28

source share

user6022341 · Accepted Answer · 2016-09-16T01:05:45+0000

Use circuit output:

 val cities = spark.read .option("inferSchema", "true") ...

or provide a diagram:

 val cities = spark.read .schema(StructType(Array(StructField("name", StringType), ...)

or cast:

 val cities = spark.read .option("header", "true") .csv(location) .withColumn("number_of_people", col("number_of_people").cast(LongType)) .as[City]

Create Spark Dataset from CSV File

More articles: