How to make inferSchema for CSV treat integers as dates (with the option "dateFormat")?

Question

How to make inferSchema for CSV treat integers as dates (with the option "dateFormat")?

I am using Spark 2.2.0

I read the csv file as follows:

val dataFrame = spark.read.option("inferSchema", "true") .option("header", true) .option("dateFormat", "yyyyMMdd") .csv(pathToCSVFile)

There is one date column in this file, and for this column all entries have a value equal to 20171001 .

The problem is that the spark infers that the type of this column is integer , not date . When I "inferSchema" parameter, the type of this column is string .

There are no null or malformed string in this file.

What is the reason / solution to this problem?

+4

dataframe spark-csv apache-spark apache-spark-sql spark-dataframe

Rami Oct 2 '17 at 16:08

source share

2 answers

In this case, you simply cannot depend on the output of the circuit due to the ambiguity of the format.

Since the input can be parsed as IntegerType (or any higher precision numeric format) or TimestamType , and the first has a higher priority (inside Spark tries IntegerType → LongType → DecimaType → DoubleType → TimestampType ) the output mechanism will never reach the TimestampType case.

To be specific, if schema output is enabled, Spark will call tryParseInteger , which tryParseInteger input correctly and stops . The subsequent call will correspond to the second case and end with the same call to tryParseInteger .

+2

user6910411 Oct 2 '17 at 16:52

source share

Jacek laskowski · Accepted Answer · 2017-10-02T16:54:56+0000

If my understanding is correct, code implies the following order of type inference (during the first check of the first type):

NullType
IntegerType
LongType
DecimalType
DoubleType
TimestampType
BooleanType
StringType

With this, I think the problem is that 20171001 matches IntegerType before considering TimestampType (which uses the timestampFormat not dateFormat ).

One solution would be to define the schema and use it with the schema operator (from the DataFrameReader ), or let Spark SQL infer the schema and use the cast operator.

I would choose the first if the number of fields is not large.

How to make inferSchema for CSV treat integers as dates (with the option "dateFormat")?

More articles: