Pandas read_csv populates empty values ​​with the string 'nan', instead of the parsing date

I assign np.nan missing values ​​in the DataFrame column. Then the DataFrame is written to the csv file using to_csv. As a result, the csv file does not correctly have missing values ​​between commas if I open the file with a text editor. But when I read this csv file back into the DataFrame using read_csv, the missing values ​​will become the string 'nan' instead of NaN. As a result, isnull() does not work. For instance:

 In [13]: df Out[13]: index value date 0 975 25.35 nan 1 976 26.28 nan 2 977 26.24 nan 3 978 25.76 nan 4 979 26.08 nan In [14]: df.date.isnull() Out[14]: 0 False 1 False 2 False 3 False 4 False 

Am I doing something wrong? Should I assign some other values ​​instead of np.nan missing values ​​so that isnull() can pick up?

EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse this column. This column contains dates with some rows missing. I would like to have the missing NaN lines.

EIDT: I just found out that the problem is really related to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as a string and assign the string "nan" to null values.

 In [21]: data = pd.read_csv('test.csv', parse_dates = [1]) In [22]: data Out[22]: value date id 0 2 2013-3-1 a 1 3 2013-3-1 b 2 4 2013-3-1 c 3 5 nan d 4 6 2013-3-1 d In [23]: data.date[3] Out[23]: 'nan' 

pd.to_datetime does not work:

 In [12]: data Out[12]: value date id 0 2 2013-3-1 a 1 3 2013-3-1 b 2 4 2013-3-1 c 3 5 nan d 4 6 2013-3-1 d In [13]: data.dtypes Out[13]: value int64 date object id object In [14]: pd.to_datetime(data['date']) Out[14]: 0 2013-3-1 1 2013-3-1 2 2013-3-1 3 nan 4 2013-3-1 Name: date 

Is there a way for read_csv parse_dates to work with columns that contain missing values? That is, assign NaN to missing values ​​and still parse valid dates?

+4
source share
3 answers

This is currently a parser in the parser, see https://github.com/pydata/pandas/issues/3062 An easy workaround is to force the column to be converted after you read it (and will fill in nans NaT which is a Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1

 In [22]: df Out[22]: value date id 0 2 2013-3-1 a 1 3 2013-3-1 b 2 4 2013-3-1 c 3 5 NaN d 4 6 2013-3-1 d In [23]: df.dtypes Out[23]: value int64 date object id object dtype: object In [24]: pd.to_datetime(df['date']) Out[24]: 0 2013-03-01 00:00:00 1 2013-03-01 00:00:00 2 2013-03-01 00:00:00 3 NaT 4 2013-03-01 00:00:00 Name: date, dtype: datetime64[ns] 

If the string "nan" appears in your data, you can do this:

 In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1']) In [32]: s Out[32]: 0 2013-1-1 1 2013-1-1 2 nan 3 2013-1-1 dtype: object In [39]: s[s=='nan'] = np.nan In [40]: s Out[40]: 0 2013-1-1 1 2013-1-1 2 NaN 3 2013-1-1 dtype: object In [41]: pandas.to_datetime(s) Out[41]: 0 2013-01-01 00:00:00 1 2013-01-01 00:00:00 2 NaT 3 2013-01-01 00:00:00 dtype: datetime64[ns] 
+7
source

You can pass the parameter na_values=["nan"] to your read_csv function read_csv . This will read the values ​​of the nan lines and convert them to the correct np.nan format.

See here for more details.

+3
source

I have the same problem. Import a csv file using

 dataframe1 = pd.read_csv(input_file, parse_date=['date1', 'date2']) 

where date1 contains valid dates and date2 is an empty column. Apparently, dataframe1 ['date2'] is populated with the whole nan column.

The fact is that after specifying date columns from a data frame and using read_csv to import data, an empty date column will be filled with the string "nan" instead of NaN.

The latter can be recognized by numpy and pandas as NULL, while the former cannot.

A simple solution:

 from numpy import nan dataframe.replace('nan', nan, inplace=True) 

And then you should be good to go!

0
source

All Articles