I have a very strange problem with pyspark on macOS Sierra . My goal is to ddMMMyyyy dates in ddMMMyyyy format (ex: 31Dec1989 ), but get errors. I am running Spark 2.0.1, Python 2.7.10 and Java 1.8.0_101. I also tried using Anaconda 4.2.0 (it comes with Python 2.7.12), but also errors.
The same code on Ubuntu Server 15.04 with the same version of Java and Python 2.7.9 works without errors.
The official documentation for spark.read.load() reads:
dateFormat - sets a string that indicates the date format. Custom date formats follow the formats in java.text.SimpleDateFormat . This is true today. If set to No, it uses the default value, yyyy-MM-dd .
The official Java docs talk about MMM as the right format for parsing month names like Jan , Dec , etc. but it raises a lot of errors starting with java.lang.IllegalArgumentException . The documentation states that LLL can also be used, but pyspark does not recognize it and throws pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern component: LLL' .
I know another solution for dateFormat , but this is the fastest way to parse data and the simplest code. What am I missing here?
To run the following examples, you just need to put test.csv and test.py in the same directory, then run <spark-bin-directory>/spark-submit <working-directory>/test.py .
My test case using ddMMMyyyy format
I have a text file called test.csv containing the following two lines:
col1 31Dec1989
and the code is as follows:
from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession \ .builder \ .appName("My app") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() struct = StructType([StructField("column", DateType())]) df = spark.read.load( "test.csv", \ schema=struct, \ format="csv", \ sep=",", \ header="true", \ dateFormat="ddMMMyyyy", \ mode="FAILFAST") df.show()
I get errors. I also tried moving the name of the month before or after days and years (for example: 1989Dec31 and yyyyMMMdd ) without success.
Working example using ddMMyyyy format
This example is identical to the previous one, except for the date format. test.csv now contains:
col1 31121989
The following code prints the contents of test.csv :
from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession \ .builder \ .appName("My app") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() struct = StructType([StructField("column", DateType())]) df = spark.read.load( "test.csv", \ schema=struct, \ format="csv", \ sep=",", \ header="true", \ dateFormat="ddMMyyyy", \ mode="FAILFAST") df.show()
The output is as follows (I omit the various verbose lines):
+----------+ | column| +----------+ |1989-12-31| +----------+
Update1
I created a simple Java class that uses java.text.SimpleDateFormat :
import java.text.*; import java.util.Date; class testSimpleDateFormat { public static void main(String[] args) { SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd"); String dateString = "1989Dec31"; try { Date parsed = format.parse(dateString); System.out.println(parsed.toString()); } catch(ParseException pe) { System.out.println("ERROR: Cannot parse \"" + dateString + "\""); } } }
This code does not work in my environment and throws this error:
java.text.ParseException: Unparseable date: "1989Dec31"
but works fine on another system (Ubuntu 15.04). This seems like a Java problem, but I don't know how to solve it. I installed the latest available version of Java and all my software has been updated.
Any ideas?
UPDATE2
I found how to make it work with pure Java by specifying Locale.US :
import java.text.*; import java.util.Date; import java.util.*; class HelloWorldApp { public static void main(String[] args) { SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd", Locale.US); String dateString = "1989Dec31"; try { Date parsed = format.parse(dateString); System.out.println(parsed.toString()); } catch(ParseException pe) { System.out.println(pe); System.out.println("ERROR: Cannot parse \"" + dateString + "\""); } } }
Now the question is: how to specify Java Locale in pyspark ?