I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file:
name,state,number_of_people,coolness_index trenton,nj,"10","4.5" bedford,ny,"20","3.3" patterson,nj,"30","2.2" camden,nj,"40","8.8"
Here is the code to create the dataset:
var location = "s3a://path_to_csv" case class City(name: String, state: String, number_of_people: Long) val cities = spark.read .option("header", "true") .option("charset", "UTF8") .option("delimiter",",") .csv(location) .as[City]
Here is the error message: "It is not possible to cast number_of_people from a string to bigint because it can crop"
The Databricks talks about creating datasets and about this particular error message in this blog post .
Encoders eagerly check to see if your data matches the expected pattern by providing useful error messages before attempting to process TB data erroneously. For example, if we try to use a data type that is too small, so converting to an object will result in truncation (i.e. numStudents is larger than a byte that contains a maximum value of 255), the analyzer will throw an Analysis exception.
I am using the Long type, so I did not expect to see this error message.