Grouping a Pandas DataFrame by n days starting at the beginning of the day

Question

Grouping a Pandas DataFrame by n days starting at the beginning of the day

I just discovered the power of Pandas, and I like it, but I can not understand this problem:

I have a DataFrame df.head() :

  lon lat h filename time 0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467 1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467 2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467 3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467 4 19.570708 80.588183 0.162943 60048 2002-05-15 12:59:32.888467

I would like to group my data into nine time intervals

 gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))

First group:

 2002-05-15 12:59:31.717467 lon lat h filename time 0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467 1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467 2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467 3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467 ...

Next group:

 2002-05-24 12:59:31.717467 lon lat height filename time 815 18.309498 80.457024 0.187387 60309 2002-05-24 16:35:39.553563 816 18.291458 80.458514 0.061446 60309 2002-05-24 16:35:39.610563 817 18.273408 80.460014 0.129255 60309 2002-05-24 16:35:39.667563 818 18.255358 80.461504 0.046761 60309 2002-05-24 16:35:39.724563 ...

Thus, the data are grouped for nine days, counting from the first time (12: 59: 31.717467), and not from the very beginning of the day, as we would like.

When grouping for one day:

 gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))

gives me:

 2002-05-15 00:00:00 lon lat h filename time 0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467 1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467 2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467 3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467 ...

I can just go in cycles the other day until I get a nine-day interval, but I think it can be done smarter, I'm looking for a Grouper freq option equivalent to YS (beginning of the year) in just a few days, a way to set the start time (perhaps using the option Grouper convention : {'start', 'end', 'e', 's'} ) or

I am running Python 3.5.2 and Pandas is in version: 0.19.0

+7

python pandas

user1643523 Nov 11 '16 at 14:21

source share

3 answers

If you truncate dates before midnight on a given day, the grouping will work as expected (starting at the beginning of the day). I expected it to work by converting to datetime, e.g.

 df['date'] = df['time'].apply(lambda x:x.date())

However, you cannot use TimeGrouper if the index is not a datetime . Instead, you have two options: either shorten the dates until midnight directly as follows:

 df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))

Alternatively, you can first generate date values and then convert them back to datetimes using the pd.to_datetime() function:

 df['date'] = df['time'].apply(lambda x: x.date() ) df['date'] = pd.to_datetime(df['date'])

+1

mfitzp Nov 11 '16 at 14:59

source share

completing @mfitzp answer you can do this:

 df['dateonly'] = df['time'].apply(lambda x: x.date())

only problem with this df['dateonly'] will not be DatetimeIndex

you need to convert it first:

 df['dateonly'] = pd.to_datetime(df['dateonly'])

now you can group it

 gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))

and for additional information convention used with PeriodIndex not DatetimeIndex

+1

Steven g Nov 11 '16 at 15:04

source share

Nickil maveli · Accepted Answer · 2016-11-11T15:06:20+0000

Delete the first time line:

It’s best to normalize first row of the datetime column so that the reset time is 00:00:00 (midnight) and the group according to 9D interval:

 df.loc[0, 'time'] = df['time'].iloc[0].normalize() for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')): print (grp) # lon lat h filename time # 0 19.961216 80.617627 -0.077165 60048 2002-05-15 00:00:00.000000 # 1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467 # 2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467 # 3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467 # 4 19.570708 80.588183 0.162943 60048 2002-05-15 12:59:32.888467 # ......................................................................

This restores time on other lines and therefore you will not lose this information.

Saving the first line:

If you want to save the first time line as it is and not make any changes to it, but only want to start grouping from midnight, you can do:

 df_t_shift = df.shift() # Shift one level down df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize() # Concat last row of df with the shifted one to account for the loss of row df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True) for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')): print (grp) # lon lat h filename time # 0 NaN NaN NaN NaN 2002-05-15 00:00:00.000000 # 1 19.961216 80.617627 -0.077165 60048.0 2002-05-15 12:59:31.717467 # 2 19.923916 80.614847 -0.018689 60048.0 2002-05-15 12:59:31.831467 # 3 19.849396 80.609257 -0.089205 60048.0 2002-05-15 12:59:32.059467 # 4 19.830776 80.607857 0.076485 60048.0 2002-05-15 12:59:32.116467 # 5 19.570708 80.588183 0.162943 60048.0 2002-05-15 12:59:32.888467

Grouping a Pandas DataFrame by n days starting at the beginning of the day

More articles: