How to prevent the pandas.to_datetime () function from converting 0001-01-01 to 2001-01-01

I have read-only access to the database that I query and read into the Pandas framework using pymssql. One of the variables contains dates, some of which are stored as midnight on January 01, 0001 (i.e. 0001-01-01 00: 00: 00.0000000). I don’t know why these dates should be included - as far as I know, they are not recognized as valid SQL Server dates, and this is probably due to some default data entries. However, I have to work with this. This can be recreated as a data frame as follows:

import numpy as np import pandas as pd tempDF = pd.DataFrame({ 'id': [0,1,2,3,4], 'date': ['0001-01-01 00:00:00.0000000', '2015-05-22 00:00:00.0000000', '0001-01-01 00:00:00.0000000', '2015-05-06 00:00:00.0000000', '2015-05-03 00:00:00.0000000']}) 

The information frame looks like this:

 print(tempDF) date id 0 0001-01-01 00:00:00.0000000 0 1 2015-05-22 00:00:00.0000000 1 2 0001-01-01 00:00:00.0000000 2 3 2015-05-06 00:00:00.0000000 3 4 2015-05-03 00:00:00.0000000 4 

... with the following types:

 print(tempDF.dtypes) date object id int64 dtype: object print(tempDF.dtypes) 

However, I regularly convert dateframe data fields to datetime format using:

 tempDF['date'] = pd.to_datetime(tempDF['date']) 

However, by chance, I noticed that the date 0001-01-01 is being converted to 2001-01-01.

 print(tempDF) date id 0 2001-01-01 0 1 2015-05-22 1 2 2001-01-01 2 3 2015-05-06 3 4 2015-05-03 4 

I understand that the dates in the source database are incorrect because SQL Server does not see 0001-01-01 as a valid date. But at least in the format 0001-01-01, such missing data is easy to identify in my Pandas framework. However, when pandas.to_datetime () changes these dates, so they are in the acceptable range, it is very easy to skip such outliers.

How can I make sure pd.to_datetime does not correctly interpret departure dates?

+6
source share
1 answer

If you provide format , these dates will not be recognized:

 In [92]: pd.to_datetime(tempDF['date'], format="%Y-%m-%d %H:%M:%S.%f", errors='coerce') Out[92]: 0 NaT 1 2015-05-22 2 NaT 3 2015-05-06 4 2015-05-03 Name: date, dtype: datetime64[ns] 

By default, this will be an error, but after passing errors='coerce' , they are converted to NaT values ​​( coerce=True for older versions of pandas).

The reason pandas converts these dates β€œ0001-01-01” to β€œ2001-01-01” without providing format , because this is a dateutil behavior:

 In [32]: import dateutil In [33]: dateutil.parser.parse("0001-01-01") Out[33]: datetime.datetime(2001, 1, 1, 0, 0) 
+4
source

All Articles