How to prevent the pandas.to_datetime () function from converting 0001-01-01 to 2001-01-01

Question

How to prevent the pandas.to_datetime () function from converting 0001-01-01 to 2001-01-01

I have read-only access to the database that I query and read into the Pandas framework using pymssql. One of the variables contains dates, some of which are stored as midnight on January 01, 0001 (i.e. 0001-01-01 00: 00: 00.0000000). I don’t know why these dates should be included - as far as I know, they are not recognized as valid SQL Server dates, and this is probably due to some default data entries. However, I have to work with this. This can be recreated as a data frame as follows:

import numpy as np import pandas as pd tempDF = pd.DataFrame({ 'id': [0,1,2,3,4], 'date': ['0001-01-01 00:00:00.0000000', '2015-05-22 00:00:00.0000000', '0001-01-01 00:00:00.0000000', '2015-05-06 00:00:00.0000000', '2015-05-03 00:00:00.0000000']})

The information frame looks like this:

 print(tempDF) date id 0 0001-01-01 00:00:00.0000000 0 1 2015-05-22 00:00:00.0000000 1 2 0001-01-01 00:00:00.0000000 2 3 2015-05-06 00:00:00.0000000 3 4 2015-05-03 00:00:00.0000000 4

... with the following types:

 print(tempDF.dtypes) date object id int64 dtype: object print(tempDF.dtypes)

However, I regularly convert dateframe data fields to datetime format using:

 tempDF['date'] = pd.to_datetime(tempDF['date'])

However, by chance, I noticed that the date 0001-01-01 is being converted to 2001-01-01.

 print(tempDF) date id 0 2001-01-01 0 1 2015-05-22 1 2 2001-01-01 2 3 2015-05-06 3 4 2015-05-03 4

I understand that the dates in the source database are incorrect because SQL Server does not see 0001-01-01 as a valid date. But at least in the format 0001-01-01, such missing data is easy to identify in my Pandas framework. However, when pandas.to_datetime () changes these dates, so they are in the acceptable range, it is very easy to skip such outliers.

How can I make sure pd.to_datetime does not correctly interpret departure dates?

+6

python pandas datetime dataframe

user1718097 Feb 14 '16 at 12:25

source share

1 answer

joris · Accepted Answer · 2016-02-14T12:32:10+0000

If you provide format , these dates will not be recognized:

 In [92]: pd.to_datetime(tempDF['date'], format="%Y-%m-%d %H:%M:%S.%f", errors='coerce') Out[92]: 0 NaT 1 2015-05-22 2 NaT 3 2015-05-06 4 2015-05-03 Name: date, dtype: datetime64[ns]

By default, this will be an error, but after passing errors='coerce' , they are converted to NaT values ( coerce=True for older versions of pandas).

The reason pandas converts these dates “0001-01-01” to “2001-01-01” without providing format , because this is a dateutil behavior:

 In [32]: import dateutil In [33]: dateutil.parser.parse("0001-01-01") Out[33]: datetime.datetime(2001, 1, 1, 0, 0)

How to prevent the pandas.to_datetime () function from converting 0001-01-01 to 2001-01-01

More articles: