It is possible to display dates using the format of your choice (I used the Date.toJSON format) with a slight modification and also had reasonable performance.
Get the latest service branch:
git clone https://github.com/apache/spark.git cd spark git checkout branch-1.4
Replace the following block in InferSchema :
case VALUE_STRING if parser.getTextLength < 1 =>
with the following code:
case VALUE_STRING => val len = parser.getTextLength if (len < 1) { NullType } else if (len == 24) { // try to match dates of the form "1968-01-01T12:34:56.789Z" // for performance, only try parsing if text is 24 chars long and ends with a Z val chars = parser.getTextCharacters val offset = parser.getTextOffset if (chars(offset + len - 1) == 'Z') { try { org.apache.spark.sql.catalyst.util. DateUtils.stringToTime(new String(chars, offset, len)) TimestampType } catch { case e: Exception => StringType } } else { StringType } } else { StringType }
Build Spark according to your settings. I used:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install
To test, create a file called datedPeople.json at the top level that contains the following data:
{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"} {"name":"Bob"} {"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"} {"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}
Read in the file. Make sure you set the conf parameter before using sqlContext at all, or it will not work. Dates!
.\bin\spark-shell.cmd scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true") scala> val datedPeople = sqlContext.read.json("datedPeople.json") datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string] scala> datedPeople.foreach(println) [2012-04-23 13:25:43.511,Andy] [1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ] [null,Bob] [1988-11-24 05:21:13.121,This has 24 characters!!]
heenenee
source share