What is the correct Date / Datetime format in JSON for Spark SQL to automatically output the schema for it?

Question

What is the correct Date / Datetime format in JSON for Spark SQL to automatically output the schema for it?

Spark SQL supports automatic schema detection from a JSON input source (each line is a stand-alone JSON file) - it does this by scanning the entire data set to create a schema, but it is still useful. (I'm talking about 1.2.1, not the new 1.3, so there may be some changes)

I saw some conflicting messages that it is supported / not supported, but I think it was recently added (in 1.2)

My question is: what is the right way to format Date / Datetime / Timestamp in JSON for Spark SQL to identify it as such in the output mechanism of the automatic schema schema?

+7

apache-spark apache-spark-sql

Eran medan Mar 27 '15 at 15:59

source share

2 answers

JSON type inference will never infer date types. Strings without zero length are always output as strings. Source Code :

 private[sql] object InferSchema { // ... private def inferField(parser: JsonParser): DataType = { import com.fasterxml.jackson.core.JsonToken._ parser.getCurrentToken match { // ... case VALUE_STRING => StringType // ... } } // ... }

For automatic detection, this will need to be changed to view the actual string ( parser.getValueAsString ) and based on the format of the return DateType when necessary.

It is probably easier to just take a regular, automatically generated schema and convert the date types as a second step.

Another option is to read a small sample of data (without using Spark) and display the diagram yourself. Then use your schema to create the DataFrame. This avoids some calculations.

+5

Daniel Darabos Jul 20 '15 at 13:10

source share

heenenee · Accepted Answer · 2015-07-21T07:20:19+0000

It is possible to display dates using the format of your choice (I used the Date.toJSON format) with a slight modification and also had reasonable performance.

Get the latest service branch:

git clone https://github.com/apache/spark.git cd spark git checkout branch-1.4

Replace the following block in InferSchema :

  case VALUE_STRING if parser.getTextLength < 1 => // Zero length strings and nulls have special handling to deal // with JSON generators that do not distinguish between the two. // To accurately infer types for empty strings that are really // meant to represent nulls we assume that the two are isomorphic // but will defer treating null fields as strings until all the // record fields' types have been combined. NullType case VALUE_STRING => StringType

with the following code:

  case VALUE_STRING => val len = parser.getTextLength if (len < 1) { NullType } else if (len == 24) { // try to match dates of the form "1968-01-01T12:34:56.789Z" // for performance, only try parsing if text is 24 chars long and ends with a Z val chars = parser.getTextCharacters val offset = parser.getTextOffset if (chars(offset + len - 1) == 'Z') { try { org.apache.spark.sql.catalyst.util. DateUtils.stringToTime(new String(chars, offset, len)) TimestampType } catch { case e: Exception => StringType } } else { StringType } } else { StringType }

Build Spark according to your settings. I used:

 mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install

To test, create a file called datedPeople.json at the top level that contains the following data:

 {"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"} {"name":"Bob"} {"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"} {"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}

Read in the file. Make sure you set the conf parameter before using sqlContext at all, or it will not work. Dates!

 .\bin\spark-shell.cmd scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true") scala> val datedPeople = sqlContext.read.json("datedPeople.json") datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string] scala> datedPeople.foreach(println) [2012-04-23 13:25:43.511,Andy] [1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ] [null,Bob] [1988-11-24 05:21:13.121,This has 24 characters!!]

What is the correct Date / Datetime format in JSON for Spark SQL to automatically output the schema for it?

More articles: