Input data should be sorted by DATE in each group, this is normal in this data.
The input cannot display the situation very well, so the next 4 lines will be added.
Column WIN1 is created from WIN - values 1 for 'Yes' and 0 for 'No' . I need this for both output columns.
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
NumOfDaysSinceLastWin Column
The cumsum column (total) is created.
df['cumsum'] = df['WIN1'].cumsum()
If all WIN are 'Yes' , this is easy. The data will be grouped, and the difference between the date and the previous date (-1) will be indicated in the diffs columns.
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (dd.shift()).fillna(0))
But the situation is complicated because the value is 'No' WIN column. Therefore, if the value is 'Yes' , you need a difference with the previous 'Yes' , if 'No' needs a difference with the last 'WIN' . The difference can be calculated in many ways, but is selected by subtracting two columns - DATE and column date1 .
date1 column
Lines should be grouped in a special way - the values โโare 'No' and the last previous 'Yes' . This is possible due to the total amount in the cumsum column. Then the minimum value of this group is the value of the column 'Yes' , and then this value is repeated to rows with 'No' values. The count column is special โ no duplicate cumsum column cumsum are 1 . Duplicates increase in groups.
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min') df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
The dates of the 'Yes' values โโin the previous lines are needed for the difference. Dataframe df1 only filtered the 'Yes' values โโfrom df and then grouped it by the ID column. Indexes do not change, so the output can be mapped to the new column of the df data frame.
df1 = df[~df['WIN'].isin(['No'])] df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) print df ID DATE WIN WIN1 cumsum min count date1 0 A 2015-06-05 Yes 1 1 2015-06-05 1 NaT 1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07 4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07 5 A 2015-06-08 No 0 5 2015-06-07 4 NaT 6 B 2015-06-07 No 0 5 2015-06-07 4 NaT 7 B 2015-06-07 No 0 5 2015-06-07 4 NaT 8 B 2015-08-07 Yes 1 6 2015-08-07 1 NaT 9 C 2015-05-15 Yes 1 7 2015-05-15 3 NaT 10 C 2015-05-30 No 0 7 2015-05-15 3 NaT 11 C 2015-07-30 No 0 7 2015-05-15 3 NaT 12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
Then, the min date columns (the values โโof 'No' and the last previous 'Yes' ) and the date1 column (other values โโof 'Yes' ) can be combined by count columns.
A new condition has been added - the values โโof the column date1 will be null - ( NaT ), since these values โโwill be overwritten by the column min .
df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min'] print df ID DATE WIN WIN1 cumsum min count date1 0 A 2015-06-05 Yes 1 1 2015-06-05 1 2015-06-05 1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07 4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07 5 A 2015-06-08 No 0 5 2015-06-07 4 2015-06-07 6 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07 7 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07 8 B 2015-08-07 Yes 1 6 2015-08-07 1 2015-08-07 9 C 2015-05-15 Yes 1 7 2015-05-15 3 2015-05-15 10 C 2015-05-30 No 0 7 2015-05-15 3 2015-05-15 11 C 2015-07-30 No 0 7 2015-05-15 3 2015-05-15 12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
Repeat datetimes - shorthand
Sorry if this is done in a complicated way, maybe someone will find the best.
My solution finds duplicate values, fills them with the previous previous 'Yes' and is added to the date1 column for the difference.
These values โโare listed in the count column. Other (values 1 ) equal reset - NaN . Then the values โโfrom date1 copied to date2 by the count column.
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count') df.loc[df['count'] == 1 , 'count'] = np.nan df.loc[df['count'].notnull() , 'date2'] = df['date1'] print df ID DATE WIN WIN1 cumsum min count date1 date2 0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05 1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-07 2015-06-07 4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-07 2015-06-07 5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT 6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT 9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT 10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-08-03 2015-08-03
Then these values โโare repeated with the minimum value of the group and added to the column date1 .
def repeat_value(grp): grp['date2'] = grp['date2'].min() return grp df = df.groupby(['ID', 'DATE']).apply(repeat_value) df.loc[df1['date2'].notnull() , 'date1'] = df['date2'] print df ID DATE WIN WIN1 cumsum min count date1 date2 0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05 1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05 2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05 3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-05 2015-06-05 4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-05 2015-06-05 5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT 6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT 8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT 9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT 10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT 12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15 13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-05-15 2015-05-15
The NumOfDaysSinceLastWin column NumOfDaysSinceLastWin populated by the difference between col date1 and DATE . The data type is Timedelta , so it will be converted to an integer. Finally, unnecessary columns will be deleted. (Only the columns WIN1 and count needed for the next output column, so it is not deleted.)
df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]') df = df.drop(['cumsum','min', 'date1'], axis=1 ) print df ID DATE WIN WIN1 count NumOfDaysSinceLastWin 0 A 2015-06-05 Yes 1 2 0 1 A 2015-06-05 Yes 1 2 0 2 A 2015-06-07 Yes 1 3 2 3 A 2015-06-07 Yes 1 3 2 4 A 2015-06-07 Yes 1 3 2 5 A 2015-06-08 No 0 NaN 1 6 B 2015-06-07 No 0 NaN 0 7 B 2015-06-07 No 0 NaN 0 8 B 2015-08-07 Yes 1 NaN 0 9 C 2015-05-15 Yes 1 NaN 0 10 C 2015-05-30 No 0 NaN 15 11 C 2015-07-30 No 0 NaN 76 12 C 2015-08-03 Yes 1 2 80 13 C 2015-08-03 Yes 1 2 80
NumOfWinsInThePast30days Column
The moving amount is your friend. The yes column (required for resampling) displays 1 for 'Yes' and NaN for 'No' .
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
Dataframe df2 is a copy of df , and DATE columns are an index (for re-fetching). Optional columns will be deleted.
df2 = df.set_index('DATE') df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)
Then df2 will be resampled by day if the string 'Yes' is 1 , if 'No' is 0 . (Better look below, as explained.)
df2 = df2.groupby('ID').resample("D", how='count') df2 = df2.reset_index()
The dataframe df2 will be grouped by ID , and the rolling_sum function rolling_sum used for these groups.
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
For a better understanding, all df2 lines are df2 .
with pd.option_context('display.max_rows', 999, 'display.max_columns', 5): print df2 ID DATE yes rollsum 0 A 2015-06-05 2 2 1 A 2015-06-06 0 2 2 A 2015-06-07 3 5 3 A 2015-06-08 0 5 4 B 2015-06-07 0 0 5 B 2015-06-08 0 0 6 B 2015-06-09 0 0 7 B 2015-06-10 0 0 8 B 2015-06-11 0 0 9 B 2015-06-12 0 0 10 B 2015-06-13 0 0 11 B 2015-06-14 0 0 12 B 2015-06-15 0 0 13 B 2015-06-16 0 0 14 B 2015-06-17 0 0 15 B 2015-06-18 0 0 16 B 2015-06-19 0 0 17 B 2015-06-20 0 0 18 B 2015-06-21 0 0 19 B 2015-06-22 0 0 20 B 2015-06-23 0 0 21 B 2015-06-24 0 0 22 B 2015-06-25 0 0 23 B 2015-06-26 0 0 24 B 2015-06-27 0 0 25 B 2015-06-28 0 0 26 B 2015-06-29 0 0 27 B 2015-06-30 0 0 28 B 2015-07-01 0 0 29 B 2015-07-02 0 0 30 B 2015-07-03 0 0 31 B 2015-07-04 0 0 32 B 2015-07-05 0 0 33 B 2015-07-06 0 0 34 B 2015-07-07 0 0 35 B 2015-07-08 0 0 36 B 2015-07-09 0 0 37 B 2015-07-10 0 0 38 B 2015-07-11 0 0 39 B 2015-07-12 0 0 40 B 2015-07-13 0 0 41 B 2015-07-14 0 0 42 B 2015-07-15 0 0 43 B 2015-07-16 0 0 44 B 2015-07-17 0 0 45 B 2015-07-18 0 0 46 B 2015-07-19 0 0 47 B 2015-07-20 0 0 48 B 2015-07-21 0 0 49 B 2015-07-22 0 0 50 B 2015-07-23 0 0 51 B 2015-07-24 0 0 52 B 2015-07-25 0 0 53 B 2015-07-26 0 0 54 B 2015-07-27 0 0 55 B 2015-07-28 0 0 56 B 2015-07-29 0 0 57 B 2015-07-30 0 0 58 B 2015-07-31 0 0 59 B 2015-08-01 0 0 60 B 2015-08-02 0 0 61 B 2015-08-03 0 0 62 B 2015-08-04 0 0 63 B 2015-08-05 0 0 64 B 2015-08-06 0 0 65 B 2015-08-07 1 1 66 C 2015-05-15 1 1 67 C 2015-05-16 0 1 68 C 2015-05-17 0 1 69 C 2015-05-18 0 1 70 C 2015-05-19 0 1 71 C 2015-05-20 0 1 72 C 2015-05-21 0 1 73 C 2015-05-22 0 1 74 C 2015-05-23 0 1 75 C 2015-05-24 0 1 76 C 2015-05-25 0 1 77 C 2015-05-26 0 1 78 C 2015-05-27 0 1 79 C 2015-05-28 0 1 80 C 2015-05-29 0 1 81 C 2015-05-30 0 1 82 C 2015-05-31 0 1 83 C 2015-06-01 0 1 84 C 2015-06-02 0 1 85 C 2015-06-03 0 1 86 C 2015-06-04 0 1 87 C 2015-06-05 0 1 88 C 2015-06-06 0 1 89 C 2015-06-07 0 1 90 C 2015-06-08 0 1 91 C 2015-06-09 0 1 92 C 2015-06-10 0 1 93 C 2015-06-11 0 1 94 C 2015-06-12 0 1 95 C 2015-06-13 0 1 96 C 2015-06-14 0 0 97 C 2015-06-15 0 0 98 C 2015-06-16 0 0 99 C 2015-06-17 0 0 100 C 2015-06-18 0 0 101 C 2015-06-19 0 0 102 C 2015-06-20 0 0 103 C 2015-06-21 0 0 104 C 2015-06-22 0 0 105 C 2015-06-23 0 0 106 C 2015-06-24 0 0 107 C 2015-06-25 0 0 108 C 2015-06-26 0 0 109 C 2015-06-27 0 0 110 C 2015-06-28 0 0 111 C 2015-06-29 0 0 112 C 2015-06-30 0 0 113 C 2015-07-01 0 0 114 C 2015-07-02 0 0 115 C 2015-07-03 0 0 116 C 2015-07-04 0 0 117 C 2015-07-05 0 0 118 C 2015-07-06 0 0 119 C 2015-07-07 0 0 120 C 2015-07-08 0 0 121 C 2015-07-09 0 0 122 C 2015-07-10 0 0 123 C 2015-07-11 0 0 124 C 2015-07-12 0 0 125 C 2015-07-13 0 0 126 C 2015-07-14 0 0 127 C 2015-07-15 0 0 128 C 2015-07-16 0 0 129 C 2015-07-17 0 0 130 C 2015-07-18 0 0 131 C 2015-07-19 0 0 132 C 2015-07-20 0 0 133 C 2015-07-21 0 0 134 C 2015-07-22 0 0 135 C 2015-07-23 0 0 136 C 2015-07-24 0 0 137 C 2015-07-25 0 0 138 C 2015-07-26 0 0 139 C 2015-07-27 0 0 140 C 2015-07-28 0 0 141 C 2015-07-29 0 0 142 C 2015-07-30 0 0 143 C 2015-07-31 0 0 144 C 2015-08-01 0 0 145 C 2015-08-02 0 0 146 C 2015-08-03 2 2
The optional yes column will be deleted.
df2 = df2.drop(['yes'], axis=1 )
Merge output with the first df data frame.
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner') print df2 ID DATE WIN WIN1 NumOfDaysSinceLastWin yes rollsum 0 A 2015-06-07 Yes 1 0 1 2 1 A 2015-06-07 Yes 1 0 1 2 2 B 2015-08-07 No 0 0 NaN 1 3 B 2015-08-07 Yes 1 0 1 1 4 C 2015-05-15 Yes 1 0 1 1 5 C 2015-05-30 No 0 15 NaN 1 6 C 2015-07-30 No 0 76 NaN 0 7 C 2015-08-03 Yes 1 80 1 1
If the values โโin the count column are not null , they are added to the count column. The rolling_sum function counts the lines of the original df with the values 'Yes' , so you need to subtract it. And these values โโ( 1 ) are in column WIN1 .
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count'] df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
Delete unnecessary columns.
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 ) print df2 ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days 0 A 2015-06-05 Yes 0 0 1 A 2015-06-05 Yes 0 0 2 A 2015-06-07 Yes 2 2 3 A 2015-06-07 Yes 2 2 4 A 2015-06-07 Yes 2 2 5 A 2015-06-08 No 1 5 6 B 2015-06-07 No 0 0 7 B 2015-06-07 No 0 0 8 B 2015-08-07 Yes 0 0 9 C 2015-05-15 Yes 0 0 10 C 2015-05-30 No 15 1 11 C 2015-07-30 No 76 0 12 C 2015-08-03 Yes 80 0 13 C 2015-08-03 Yes 80 0
And all together:
import pandas as pd import numpy as np import io