Pyspark does not recognize the MMM dateFormat template in spark.read.load () for dates like 1989Dec31 and 31Dec1989

Question

Pyspark does not recognize the MMM dateFormat template in spark.read.load () for dates like 1989Dec31 and 31Dec1989

I have a very strange problem with pyspark on macOS Sierra . My goal is to ddMMMyyyy dates in ddMMMyyyy format (ex: 31Dec1989 ), but get errors. I am running Spark 2.0.1, Python 2.7.10 and Java 1.8.0_101. I also tried using Anaconda 4.2.0 (it comes with Python 2.7.12), but also errors.

The same code on Ubuntu Server 15.04 with the same version of Java and Python 2.7.9 works without errors.

The official documentation for spark.read.load() reads:

dateFormat - sets a string that indicates the date format. Custom date formats follow the formats in java.text.SimpleDateFormat . This is true today. If set to No, it uses the default value, yyyy-MM-dd .

The official Java docs talk about MMM as the right format for parsing month names like Jan , Dec , etc. but it raises a lot of errors starting with java.lang.IllegalArgumentException . The documentation states that LLL can also be used, but pyspark does not recognize it and throws pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern component: LLL' .

I know another solution for dateFormat , but this is the fastest way to parse data and the simplest code. What am I missing here?

To run the following examples, you just need to put test.csv and test.py in the same directory, then run <spark-bin-directory>/spark-submit <working-directory>/test.py .

My test case using `ddMMMyyyy` format

I have a text file called test.csv containing the following two lines:

 col1 31Dec1989

and the code is as follows:

 from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession \ .builder \ .appName("My app") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() struct = StructType([StructField("column", DateType())]) df = spark.read.load( "test.csv", \ schema=struct, \ format="csv", \ sep=",", \ header="true", \ dateFormat="ddMMMyyyy", \ mode="FAILFAST") df.show()

I get errors. I also tried moving the name of the month before or after days and years (for example: 1989Dec31 and yyyyMMMdd ) without success.

Working example using `ddMMyyyy` format

This example is identical to the previous one, except for the date format. test.csv now contains:

 col1 31121989

The following code prints the contents of test.csv :

 from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession \ .builder \ .appName("My app") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() struct = StructType([StructField("column", DateType())]) df = spark.read.load( "test.csv", \ schema=struct, \ format="csv", \ sep=",", \ header="true", \ dateFormat="ddMMyyyy", \ mode="FAILFAST") df.show()

The output is as follows (I omit the various verbose lines):

 +----------+ | column| +----------+ |1989-12-31| +----------+

Update1

I created a simple Java class that uses java.text.SimpleDateFormat :

 import java.text.*; import java.util.Date; class testSimpleDateFormat { public static void main(String[] args) { SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd"); String dateString = "1989Dec31"; try { Date parsed = format.parse(dateString); System.out.println(parsed.toString()); } catch(ParseException pe) { System.out.println("ERROR: Cannot parse \"" + dateString + "\""); } } }

This code does not work in my environment and throws this error:

 java.text.ParseException: Unparseable date: "1989Dec31"

but works fine on another system (Ubuntu 15.04). This seems like a Java problem, but I don't know how to solve it. I installed the latest available version of Java and all my software has been updated.

Any ideas?

UPDATE2

I found how to make it work with pure Java by specifying Locale.US :

 import java.text.*; import java.util.Date; import java.util.*; class HelloWorldApp { public static void main(String[] args) { SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd", Locale.US); String dateString = "1989Dec31"; try { Date parsed = format.parse(dateString); System.out.println(parsed.toString()); } catch(ParseException pe) { System.out.println(pe); System.out.println("ERROR: Cannot parse \"" + dateString + "\""); } } }

Now the question is: how to specify Java Locale in pyspark ?

+6

java python date-formatting apache-spark pyspark

pietrop Oct 12 '16 at 20:40

source share

3 answers

You have already identified the problem as one of the locales in the Spark JVM. You can check the default country and language settings used by your Spark JVM by going to http: // localhost: 4040 / environment / after starting the spark shell. Find "user.language" and user.country "under" System Properties. "It should be US and ru .

You can change them like this if necessary.

Option 1: edit the spark-defaults.conf file in the {SPARK_HOME} / conf folder. Add the following settings:

 spark.executor.extraJavaOptions -Duser.country=US -Duser.language=en spark.driver.extraJavaOptions -Duser.country=US -Duser.language=en

Option 2: pass pyspark parameters as command line parameter

  $pyspark --conf spark.driver.extraJavaOptions="-Duser.country=US,-Duser.language=en" spark.executor.extraJavaOptions="-Duser.country=US,-Duser.language=en"

Option 3: Change the language and region on Mac OS. For example - What settings on Mac OS X affect the `Locale` and the` Calendar` inside Java?

PS - I just confirmed that Option 1 works. I have not tried another 2. More detailed information on Spark configuration is given here - http://spark.apache.org/docs/latest/configuration.html#runtime-environment

+3

Shankar ps Mar 14 '17 at 10:04

source share

I have not tested this, but I would ask for the following:

 --conf spark.executor.extraJavaOptions="-Duser.timezone=America/Los_Angeles" --conf spark.driver.extraJavaOptions="-Duser.timezone=America/Los_Angeles"

+1

Teodor-bogdan barbieru Mar 18 '17 at 0:20

source share

eddies · Accepted Answer · 2017-03-16T09:02:45+0000

It is probably worth noting that this was allowed on the Spark mailing list on October 24, 2016. To the original poster:

This worked without setting other parameters: spark/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Duser.language=en" test.py

and was specified as SPARK-18076 (Correct the default locale used in DateFormat, NumberFormat to Locale.US) against Spark 2.0.1 and was allowed in Spark 2.1.0.

In addition, although the workaround described above (passing to --conf "spark.driver.extraJavaOptions=-Duser.language=en" ) for the specific problem created by the submitter is no longer needed if Spark 2.1.0 is used, a noticeable side effect is that for Spark 2.1.0 users, you can no longer pass something like --conf "spark.driver.extraJavaOptions=-Duser.language=fr" if you want to parse a non-English date, like "31mai1989".

In fact, with Spark 2.1.0 when using spark.read() to load csv, I think it is no longer possible to use the dateFormat parameter to parse a date such as "31mai1989", even if your default locale is French. I got to changing the default area and language in my operating system to French and passed almost every language standard setting variable that I could think of, i.e.

 JAVA_OPTS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \ JAVA_ARGS="-Duser.language=fr -Duser.country=FR -Duser.region=FR" \ LC_ALL=fr_FR.UTF-8 \ spark-submit \ --conf "spark.driver.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \ --conf "spark.executor.extraJavaOptions=-Duser.country=FR -Duser.language=fr -Duser.region=FR" \ test.py

to no avail, resulting in

 java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)

But again, this only affects the parsing of non-English dates in Spark 2.1.0.

Pyspark does not recognize the MMM dateFormat template in spark.read.load () for dates like 1989Dec31 and 31Dec1989

My test case using ddMMMyyyy format

Working example using ddMMyyyy format

More articles:

My test case using `ddMMMyyyy` format

Working example using `ddMMyyyy` format