Pd.read_csv does not correctly handle the date / month field when setting parse_date = ['column name']

I ran into this error while trying to parse multiple dates via parse_dates from pandas.read_csv() . In the following code snippet, I am trying to parse dates with the format dd/mm/yy , which leads to an incorrect conversion. In some cases, the date field is considered a month and vice versa.

To keep it simple, in some cases dd/mm/yy converted to yyyy-dd-mm instead of yyyy-mm-dd .

Case 1:

  04/10/96 is parsed as 1996-04-10, which is wrong. 

Case 2:

  15/07/97 is parsed as 1997-07-15, which is correct. 

Case 3:

  10/12/97 is parsed as 1997-10-12, which is wrong. 

Code example

 import pandas as pd df = pd.read_csv('date_time.csv') print 'Data in csv:' print df print df['start_date'].dtypes print '----------------------------------------------' df = pd.read_csv('date_time.csv', parse_dates = ['start_date']) print 'Data after parsing:' print df print df['start_date'].dtypes 

Current output

 ---------------------- Data in csv: ---------------------- start_date 0 04/10/96 1 15/07/97 2 10/12/97 3 06/03/99 4 //1994 5 /02/1967 object ---------------------- Data after parsing: ---------------------- start_date 0 1996-04-10 1 1997-07-15 2 1997-10-12 3 1999-06-03 4 1994-01-01 5 1967-02-01 datetime64[ns] 

Expected Result

 ---------------------- Data in csv: ---------------------- start_date 0 04/10/96 1 15/07/97 2 10/12/97 3 06/03/99 4 //1994 5 /02/1967 object ---------------------- Data after parsing: ---------------------- start_date 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 1994-01-01 5 1967-02-01 datetime64[ns] 

Other comments:

I could use date_parser or pandas.to_datetime() to indicate the correct format for the date. But in my case, I have several date fields, for example ['//1997', '/02/1967'] , for which I need to convert ['01/01/1997','01/02/1967'] . parse_dates helps me convert these types of date fields to the expected format, without forcing me to write an extra line of code.

Is there any solution for this?

Link to @GitHub bug: https://github.com/pydata/pandas/issues/13063

+5
source share
1 answer

In pandas version 0.18.0 you can add the dayfirst=True parameter, and then it works:

 import pandas as pd import io temp=u"""start_date 04/10/96 15/07/97 10/12/97 06/03/99 //1994 /02/1967 """ #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), parse_dates = ['start_date'], dayfirst=True) start_date 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 1994-01-01 5 1967-02-01 

Another solution:

You can parse to_datetime with various format and errors='coerce' , and then combine_first :

 date1 = pd.to_datetime(df['start_date'], format='%d/%m/%y', errors='coerce') print date1 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 NaT 5 NaT Name: start_date, dtype: datetime64[ns] date2 = pd.to_datetime(df['start_date'], format='/%m/%Y', errors='coerce') print date2 0 NaT 1 NaT 2 NaT 3 NaT 4 NaT 5 1967-02-01 Name: start_date, dtype: datetime64[ns] date3 = pd.to_datetime(df['start_date'], format='//%Y', errors='coerce') print date3 0 NaT 1 NaT 2 NaT 3 NaT 4 1994-01-01 5 NaT Name: start_date, dtype: datetime64[ns] 
 print date1.combine_first(date2).combine_first(date3) 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 1994-01-01 5 1967-02-01 Name: start_date, dtype: datetime64[ns] 
+3
source

All Articles