Pandas groupby: count the number of occurrences within the time range for each group

Question

Pandas groupby: count the number of occurrences within the time range for each group

I have a dataframe:

ID DATE WIN A 2015/6/5 Yes A 2015/6/7 Yes A 2015/6/7 Yes A 2015/6/7 Yes B 2015/6/8 No B 2015/8/7 Yes C 2015/5/15 Yes C 2015/5/30 No C 2015/7/30 No C 2015/8/03 Yes

I want to add a column that counts the number of winnings for each ID over the last 1 month, so the result will look like this:

 ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days A 2015/6/5 Yes 0 0 A 2015/6/7 Yes 2 1 A 2015/6/7 Yes 2 1 or (A 2015/6/7 Yes 0 2) A 2015/6/8 No 1 3 B 2015/8/7 No 0 0 B 2015/8/7 Yes 0 0 C 2015/5/15 Yes 0 0 C 2015/5/30 No 15 1 C 2015/7/30 No 76 0 C 2015/8/03 Yes 80 0

How can I use the groupby and timegrouper function to get this?

+5

python pandas group-by date-difference

ChiefsCreation Sep 16 '15 at 14:52

source share

1 answer

jezrael · Answer 1 · 2015-09-19T18:43:32+0000

Input data should be sorted by DATE in each group, this is normal in this data.
The input cannot display the situation very well, so the next 4 lines will be added.

Column WIN1 is created from WIN - values 1 for 'Yes' and 0 for 'No' . I need this for both output columns.

 df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)

`NumOfDaysSinceLastWin` Column

The cumsum column (total) is created.

 df['cumsum'] = df['WIN1'].cumsum()

If all WIN are 'Yes' , this is easy. The data will be grouped, and the difference between the date and the previous date (-1) will be indicated in the diffs columns.

 #df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (dd.shift()).fillna(0))

But the situation is complicated because the value is 'No' WIN column. Therefore, if the value is 'Yes' , you need a difference with the previous 'Yes' , if 'No' needs a difference with the last 'WIN' . The difference can be calculated in many ways, but is selected by subtracting two columns - DATE and column date1 .

date1 column
Lines should be grouped in a special way - the values are 'No' and the last previous 'Yes' . This is possible due to the total amount in the cumsum column. Then the minimum value of this group is the value of the column 'Yes' , and then this value is repeated to rows with 'No' values. The count column is special — no duplicate cumsum column cumsum are 1 . Duplicates increase in groups.

 df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min') df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')

The dates of the 'Yes' values in the previous lines are needed for the difference. Dataframe df1 only filtered the 'Yes' values from df and then grouped it by the ID column. Indexes do not change, so the output can be mapped to the new column of the df data frame.

 df1 = df[~df['WIN'].isin(['No'])] df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) print df ID DATE WIN WIN1 cumsum min count date1 0 A 2015-06-05 Yes 1 1 2015-06-05 1 NaT 1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07 4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07 5 A 2015-06-08 No 0 5 2015-06-07 4 NaT 6 B 2015-06-07 No 0 5 2015-06-07 4 NaT 7 B 2015-06-07 No 0 5 2015-06-07 4 NaT 8 B 2015-08-07 Yes 1 6 2015-08-07 1 NaT 9 C 2015-05-15 Yes 1 7 2015-05-15 3 NaT 10 C 2015-05-30 No 0 7 2015-05-15 3 NaT 11 C 2015-07-30 No 0 7 2015-05-15 3 NaT 12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03

Then, the min date columns (the values of 'No' and the last previous 'Yes' ) and the date1 column (other values of 'Yes' ) can be combined by count columns.
A new condition has been added - the values of the column date1 will be null - ( NaT ), since these values will be overwritten by the column min .

 df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min'] print df ID DATE WIN WIN1 cumsum min count date1 0 A 2015-06-05 Yes 1 1 2015-06-05 1 2015-06-05 1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07 4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07 5 A 2015-06-08 No 0 5 2015-06-07 4 2015-06-07 6 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07 7 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07 8 B 2015-08-07 Yes 1 6 2015-08-07 1 2015-08-07 9 C 2015-05-15 Yes 1 7 2015-05-15 3 2015-05-15 10 C 2015-05-30 No 0 7 2015-05-15 3 2015-05-15 11 C 2015-07-30 No 0 7 2015-05-15 3 2015-05-15 12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03

Repeat datetimes - shorthand
Sorry if this is done in a complicated way, maybe someone will find the best.
My solution finds duplicate values, fills them with the previous previous 'Yes' and is added to the date1 column for the difference.
These values are listed in the count column. Other (values 1 ) equal reset - NaN . Then the values from date1 copied to date2 by the count column.

 df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count') df.loc[df['count'] == 1 , 'count'] = np.nan df.loc[df['count'].notnull() , 'date2'] = df['date1'] print df ID DATE WIN WIN1 cumsum min count date1 date2 0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05 1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-07 2015-06-07 4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-07 2015-06-07 5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT 6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT 9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT 10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-08-03 2015-08-03

Then these values are repeated with the minimum value of the group and added to the column date1 .

 def repeat_value(grp): grp['date2'] = grp['date2'].min() return grp df = df.groupby(['ID', 'DATE']).apply(repeat_value) df.loc[df1['date2'].notnull() , 'date1'] = df['date2'] print df ID DATE WIN WIN1 cumsum min count date1 date2 0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05 1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-05 2015-06-05 4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-05 2015-06-05 5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT 6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT 9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT 10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-05-15 2015-05-15

The NumOfDaysSinceLastWin column NumOfDaysSinceLastWin populated by the difference between col date1 and DATE . The data type is Timedelta , so it will be converted to an integer. Finally, unnecessary columns will be deleted. (Only the columns WIN1 and count needed for the next output column, so it is not deleted.)

 df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]') df = df.drop(['cumsum','min', 'date1'], axis=1 ) print df ID DATE WIN WIN1 count NumOfDaysSinceLastWin 0 A 2015-06-05 Yes 1 2 0 1 A 2015-06-05 Yes 1 2 0 2 A 2015-06-07 Yes 1 3 2 3 A 2015-06-07 Yes 1 3 2 4 A 2015-06-07 Yes 1 3 2 5 A 2015-06-08 No 0 NaN 1 6 B 2015-06-07 No 0 NaN 0 7 B 2015-06-07 No 0 NaN 0 8 B 2015-08-07 Yes 1 NaN 0 9 C 2015-05-15 Yes 1 NaN 0 10 C 2015-05-30 No 0 NaN 15 11 C 2015-07-30 No 0 NaN 76 12 C 2015-08-03 Yes 1 2 80 13 C 2015-08-03 Yes 1 2 80

`NumOfWinsInThePast30days` Column

The moving amount is your friend. The yes column (required for resampling) displays 1 for 'Yes' and NaN for 'No' .

 df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)

Dataframe df2 is a copy of df , and DATE columns are an index (for re-fetching). Optional columns will be deleted.

 df2 = df.set_index('DATE') df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)

Then df2 will be resampled by day if the string 'Yes' is 1 , if 'No' is 0 . (Better look below, as explained.)

 df2 = df2.groupby('ID').resample("D", how='count') df2 = df2.reset_index()

The dataframe df2 will be grouped by ID , and the rolling_sum function rolling_sum used for these groups.

 df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)

For a better understanding, all df2 lines are df2 .

 with pd.option_context('display.max_rows', 999, 'display.max_columns', 5): print df2 ID DATE yes rollsum 0 A 2015-06-05 2 2 1 A 2015-06-06 0 2 2 A 2015-06-07 3 5 3 A 2015-06-08 0 5 4 B 2015-06-07 0 0 5 B 2015-06-08 0 0 6 B 2015-06-09 0 0 7 B 2015-06-10 0 0 8 B 2015-06-11 0 0 9 B 2015-06-12 0 0 10 B 2015-06-13 0 0 11 B 2015-06-14 0 0 12 B 2015-06-15 0 0 13 B 2015-06-16 0 0 14 B 2015-06-17 0 0 15 B 2015-06-18 0 0 16 B 2015-06-19 0 0 17 B 2015-06-20 0 0 18 B 2015-06-21 0 0 19 B 2015-06-22 0 0 20 B 2015-06-23 0 0 21 B 2015-06-24 0 0 22 B 2015-06-25 0 0 23 B 2015-06-26 0 0 24 B 2015-06-27 0 0 25 B 2015-06-28 0 0 26 B 2015-06-29 0 0 27 B 2015-06-30 0 0 28 B 2015-07-01 0 0 29 B 2015-07-02 0 0 30 B 2015-07-03 0 0 31 B 2015-07-04 0 0 32 B 2015-07-05 0 0 33 B 2015-07-06 0 0 34 B 2015-07-07 0 0 35 B 2015-07-08 0 0 36 B 2015-07-09 0 0 37 B 2015-07-10 0 0 38 B 2015-07-11 0 0 39 B 2015-07-12 0 0 40 B 2015-07-13 0 0 41 B 2015-07-14 0 0 42 B 2015-07-15 0 0 43 B 2015-07-16 0 0 44 B 2015-07-17 0 0 45 B 2015-07-18 0 0 46 B 2015-07-19 0 0 47 B 2015-07-20 0 0 48 B 2015-07-21 0 0 49 B 2015-07-22 0 0 50 B 2015-07-23 0 0 51 B 2015-07-24 0 0 52 B 2015-07-25 0 0 53 B 2015-07-26 0 0 54 B 2015-07-27 0 0 55 B 2015-07-28 0 0 56 B 2015-07-29 0 0 57 B 2015-07-30 0 0 58 B 2015-07-31 0 0 59 B 2015-08-01 0 0 60 B 2015-08-02 0 0 61 B 2015-08-03 0 0 62 B 2015-08-04 0 0 63 B 2015-08-05 0 0 64 B 2015-08-06 0 0 65 B 2015-08-07 1 1 66 C 2015-05-15 1 1 67 C 2015-05-16 0 1 68 C 2015-05-17 0 1 69 C 2015-05-18 0 1 70 C 2015-05-19 0 1 71 C 2015-05-20 0 1 72 C 2015-05-21 0 1 73 C 2015-05-22 0 1 74 C 2015-05-23 0 1 75 C 2015-05-24 0 1 76 C 2015-05-25 0 1 77 C 2015-05-26 0 1 78 C 2015-05-27 0 1 79 C 2015-05-28 0 1 80 C 2015-05-29 0 1 81 C 2015-05-30 0 1 82 C 2015-05-31 0 1 83 C 2015-06-01 0 1 84 C 2015-06-02 0 1 85 C 2015-06-03 0 1 86 C 2015-06-04 0 1 87 C 2015-06-05 0 1 88 C 2015-06-06 0 1 89 C 2015-06-07 0 1 90 C 2015-06-08 0 1 91 C 2015-06-09 0 1 92 C 2015-06-10 0 1 93 C 2015-06-11 0 1 94 C 2015-06-12 0 1 95 C 2015-06-13 0 1 96 C 2015-06-14 0 0 97 C 2015-06-15 0 0 98 C 2015-06-16 0 0 99 C 2015-06-17 0 0 100 C 2015-06-18 0 0 101 C 2015-06-19 0 0 102 C 2015-06-20 0 0 103 C 2015-06-21 0 0 104 C 2015-06-22 0 0 105 C 2015-06-23 0 0 106 C 2015-06-24 0 0 107 C 2015-06-25 0 0 108 C 2015-06-26 0 0 109 C 2015-06-27 0 0 110 C 2015-06-28 0 0 111 C 2015-06-29 0 0 112 C 2015-06-30 0 0 113 C 2015-07-01 0 0 114 C 2015-07-02 0 0 115 C 2015-07-03 0 0 116 C 2015-07-04 0 0 117 C 2015-07-05 0 0 118 C 2015-07-06 0 0 119 C 2015-07-07 0 0 120 C 2015-07-08 0 0 121 C 2015-07-09 0 0 122 C 2015-07-10 0 0 123 C 2015-07-11 0 0 124 C 2015-07-12 0 0 125 C 2015-07-13 0 0 126 C 2015-07-14 0 0 127 C 2015-07-15 0 0 128 C 2015-07-16 0 0 129 C 2015-07-17 0 0 130 C 2015-07-18 0 0 131 C 2015-07-19 0 0 132 C 2015-07-20 0 0 133 C 2015-07-21 0 0 134 C 2015-07-22 0 0 135 C 2015-07-23 0 0 136 C 2015-07-24 0 0 137 C 2015-07-25 0 0 138 C 2015-07-26 0 0 139 C 2015-07-27 0 0 140 C 2015-07-28 0 0 141 C 2015-07-29 0 0 142 C 2015-07-30 0 0 143 C 2015-07-31 0 0 144 C 2015-08-01 0 0 145 C 2015-08-02 0 0 146 C 2015-08-03 2 2

The optional yes column will be deleted.

 df2 = df2.drop(['yes'], axis=1 )

Merge output with the first df data frame.

 df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner') print df2 ID DATE WIN WIN1 NumOfDaysSinceLastWin yes rollsum 0 A 2015-06-07 Yes 1 0 1 2 1 A 2015-06-07 Yes 1 0 1 2 2 B 2015-08-07 No 0 0 NaN 1 3 B 2015-08-07 Yes 1 0 1 1 4 C 2015-05-15 Yes 1 0 1 1 5 C 2015-05-30 No 0 15 NaN 1 6 C 2015-07-30 No 0 76 NaN 0 7 C 2015-08-03 Yes 1 80 1 1

If the values in the count column are not null , they are added to the count column. The rolling_sum function counts the lines of the original df with the values 'Yes' , so you need to subtract it. And these values ( 1 ) are in column WIN1 .

 df2.loc[df['count'].notnull() , 'WIN1'] = df2['count'] df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']

Delete unnecessary columns.

 df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 ) print df2 ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days 0 A 2015-06-05 Yes 0 0 1 A 2015-06-05 Yes 0 0 2 A 2015-06-07 Yes 2 2 3 A 2015-06-07 Yes 2 2 4 A 2015-06-07 Yes 2 2 5 A 2015-06-08 No 1 5 6 B 2015-06-07 No 0 0 7 B 2015-06-07 No 0 0 8 B 2015-08-07 Yes 0 0 9 C 2015-05-15 Yes 0 0 10 C 2015-05-30 No 15 1 11 C 2015-07-30 No 76 0 12 C 2015-08-03 Yes 80 0 13 C 2015-08-03 Yes 80 0

And all together:

 import pandas as pd import numpy as np import io #original data temp=u"""ID,DATE,WIN A,2015/6/5,Yes A,2015/6/7,Yes A,2015/6/7,Yes A,2015/6/8,No B,2015/6/7,No B,2015/8/7,Yes C,2015/5/15,Yes C,2015/5/30,No C,2015/7/30,No C,2015/8/03,Yes""" #changed repeating data temp2=u"""ID,DATE,WIN A,2015/6/5,Yes A,2015/6/5,Yes A,2015/6/7,Yes A,2015/6/7,Yes A,2015/6/7,Yes A,2015/6/8,No B,2015/6/7,No B,2015/6/7,No B,2015/8/7,Yes C,2015/5/15,Yes C,2015/5/30,No C,2015/7/30,No C,2015/8/03,Yes C,2015/8/03,Yes""" df = pd.read_csv(io.StringIO(temp2), parse_dates = [1]) df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0) df['cumsum'] = df['WIN1'].cumsum() #df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (dd.shift()).fillna(0)) df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min') df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count') df1 = df[~df['WIN'].isin(['No'])] df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) print df df.loc[(df['count'] >= 1) & (df['date1'].isnull()), 'date1'] = df['min'] print df #resolve repeating datetimes df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count') df.loc[df['count'] == 1 , 'count'] = np.nan df.loc[df['count'].notnull() , 'date2'] = df['date1'] print df def repeat_value(grp): grp['date2'] = grp['date2'].min() return grp df = df.groupby(['ID', 'DATE']).apply(repeat_value) df.loc[df['date2'].notnull() , 'date1'] = df['date2'] print df df['NumOfDaysSinceLastWin'] = (df['DATE'] - df['date1']).astype('timedelta64[D]') df = df.drop(['cumsum','min','date1', 'date2'], axis=1 ) print df #NumOfWinsInThePast30days df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan) df2 = df.set_index('DATE') df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1','count'], axis=1) df2 = df2.groupby('ID').resample("D", how='count') df2 = df2.reset_index() df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1) #with pd.option_context('display.max_rows', 999, 'display.max_columns', 5): #print df2 df2 = df2.drop(['yes'], axis=1 ) df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner') print df2 df2.loc[df['count'].notnull() , 'WIN1'] = df2['count'] df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1'] df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 ) print df2

Pandas groupby: count the number of occurrences within the time range for each group

NumOfDaysSinceLastWin Column

NumOfWinsInThePast30days Column

More articles:

`NumOfDaysSinceLastWin` Column

`NumOfWinsInThePast30days` Column