I have read-only access to the database that I query and read into the Pandas framework using pymssql. One of the variables contains dates, some of which are stored as midnight on January 01, 0001 (i.e. 0001-01-01 00: 00: 00.0000000). I donβt know why these dates should be included - as far as I know, they are not recognized as valid SQL Server dates, and this is probably due to some default data entries. However, I have to work with this. This can be recreated as a data frame as follows:
import numpy as np import pandas as pd tempDF = pd.DataFrame({ 'id': [0,1,2,3,4], 'date': ['0001-01-01 00:00:00.0000000', '2015-05-22 00:00:00.0000000', '0001-01-01 00:00:00.0000000', '2015-05-06 00:00:00.0000000', '2015-05-03 00:00:00.0000000']})
The information frame looks like this:
print(tempDF) date id 0 0001-01-01 00:00:00.0000000 0 1 2015-05-22 00:00:00.0000000 1 2 0001-01-01 00:00:00.0000000 2 3 2015-05-06 00:00:00.0000000 3 4 2015-05-03 00:00:00.0000000 4
... with the following types:
print(tempDF.dtypes) date object id int64 dtype: object print(tempDF.dtypes)
However, I regularly convert dateframe data fields to datetime format using:
tempDF['date'] = pd.to_datetime(tempDF['date'])
However, by chance, I noticed that the date 0001-01-01 is being converted to 2001-01-01.
print(tempDF) date id 0 2001-01-01 0 1 2015-05-22 1 2 2001-01-01 2 3 2015-05-06 3 4 2015-05-03 4
I understand that the dates in the source database are incorrect because SQL Server does not see 0001-01-01 as a valid date. But at least in the format 0001-01-01, such missing data is easy to identify in my Pandas framework. However, when pandas.to_datetime () changes these dates, so they are in the acceptable range, it is very easy to skip such outliers.
How can I make sure pd.to_datetime does not correctly interpret departure dates?