Pandas DatetimeIndex by MongoDB ISODate

I find it difficult to work with time / time zones. I have form JSON source data

{ "Date": "28 Sep 2009 00:00:00", .... } 

This data is then loaded into MongoDB, and this string representation of the date is converted to a JavaScript Date object. This conversion to UTC results in the following date

 { "_id": ObjectId("577a788f4439e17afd4e21f7"), "Date": ISODate("2009-09-27T23:00:00Z") } 

It “looks” as if the date had really been moved forward to the day, I suppose (possibly wrong) that this is due to the fact that my car is set to Irish standard time .

Then I read this data from MongoDB and used it to create pandas DatetimeIndex

 idx = pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D') 

which gives me

enter image description here

which is incorrect, since the time was not correctly converted from UTC to local time. So I implemented the solution given in this answer

 idx = pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D') idx = idx.tz_localize(tz=tz.tzutc()) idx = idx.tz_convert(tz=tz.tzlocal()) frame = DataFrame(test_docs, index=idx) frame = frame.drop('Date', 1) 

which gives me the right day ago

enter image description here

Then I normalize DatetimeIndex, so the clock is deleted, which allows me to group all the records per day.

 frame.groupby(idx).sum() 

At this point, however, something strange is happening. Dates are ultimately grouped as follows

enter image description here

but this does not reflect the dates in the frame

enter image description here

Can anyone shed some light on where I can go wrong?


Reply to @ptrj

Explicitly use my timezone as a string

 idx = pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D') idx = idx.tz_localize(tz=tz.tzutc()) idx = idx.tz_convert(tz='Europe/Dublin') idx = idx.normalize() frame = DataFrame(test_docs, index=idx) ... ... aggregate = frame.groupby(idx).sum() aggregate.plot() 

this does not work for me, it leads to the following schedule

enter image description here

For some reason, groupby is not grouping correctly in 2014, as shown below.

enter image description here

If instead I use

 idx = idx.tz_convert(tz.gettz('Europe/Dublin')) 

I get the same problem

Convert to Object

 idx = pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D') idx = idx.tz_localize(tz=tz.tzutc()) idx = idx.tz_convert(tz=tz.tzlocal()) idx = idx.normalize() frame = DataFrame(test_docs, index=idx) aggregate = frame.groupby(idx.astype(object)).sum() 

This approach works right for me

enter image description here

+6
source share
2 answers

I was able to reproduce the error with the following data:

 idx0 = pd.date_range('2011-11-11', periods=4) idx1 = idx0.tz_localize(tz.tzutc()) idx2 = idx1.tz_convert(tz.tzlocal()) df = pd.DataFrame([1, 2, 3, 4]) df.groupby(idx2).sum() Out[20]: 0 1970-01-01 00:00:00-05:00 9 2011-11-10 19:00:00-05:00 1 

This is a bug in pandas code, tz.tzlocal() to tz.tzlocal() . This also appears in:

 idx2.tz_localize(None) Out[27]: DatetimeIndex(['2011-11-10 19:00:00', '1970-01-01 00:00:00', '1970-01-01 00:00:00', '1970-01-01 00:00:00'], dtype='datetime64[ns]', freq='D') 

You can use any of the following solutions:

  • use explicit timezone as a string:

     idx2 = idx1.tz_convert(tz='Europe/Dublin') df.groupby(idx2).sum() Out[29]: 0 2011-11-11 00:00:00+00:00 1 2011-11-12 00:00:00+00:00 2 2011-11-13 00:00:00+00:00 3 2011-11-14 00:00:00+00:00 4 

    or if it does not work:

     idx2 = idx1.tz_convert(tz.gettz('Europe/Dublin')) 
  • convert it to an object:

     df.groupby(idx2.astype(object)).sum() Out[32]: 0 2011-11-10 19:00:00-05:00 1 2011-11-11 19:00:00-05:00 2 2011-11-12 19:00:00-05:00 3 2011-11-13 19:00:00-05:00 4 

Basically, converting to something other than DatetimeIndex with tz=tz.local() should work.


EDIT: This bug was only fixed in pandas github. The fix will be available in pandas version 0.19.

+2
source

I managed to get around this now by changing my groupby to the next

 frame.groupby([pd.DatetimeIndex([x.date() for x in frame.index])]).sum() 

so when i initially tried groupby

 idx = pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D') idx = idx.tz_localize(tz=tz.tzutc()) idx = idx.tz_convert(tz=tz.tzlocal()) frame.groupby(idx).sum() 

Now I call the date method for each index element before performing the groupby operation.

I send this as an answer if no one answers, but I hope someone answers and explains what happens, because my "decision" seems too hoarse for my tastes.

0
source

All Articles