Applying a special cumulative function to pandas dataframe

I have a dataframe sorted by date :

 df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2], 'date': ['2016-04-30', '2016-05-31', '2016-06-31', '2016-04-30', '2016-05-31', '2016-06-31'], 'val': [10, 0, 5, 10, 0, 0], 'pct_val': [None, -10, None, None, -10, -10]}) df = df.sort('date') print df date idx pct_val val 3 2016-04-30 2 NaN 10 0 2016-04-30 1 NaN 10 4 2016-05-31 2 -10 0 1 2016-05-31 1 -10 0 5 2016-06-31 2 -10 0 2 2016-06-31 1 NaN 5 

And I want to group using idx , then apply the cumulative function with some simple logic. If pct_val is null, add val to to total total, otherwise multiply the total by 1 + pct_val/100 . 'cumsum' shows the result of df.groupby('idx').val.cumsum() , and 'cumulative_func' 'cumsum' shows the result I want.

  date idx pct_val val cumsum cumulative_func 3 2016-04-30 2 NaN 10 10 10 0 2016-04-30 1 NaN 10 10 10 4 2016-05-31 2 -10 0 10 9 1 2016-05-31 1 -10 0 10 9 5 2016-06-31 2 -10 0 10 8 2 2016-06-31 1 NaN 5 15 14 

Any idea, if there is a way to do it, apply a custom cumulative function to a data frame, or is there a better way to achieve this?

+5
source share
2 answers

I do not believe that there is an easy way to achieve the goal using vectorization. First, I would try to get something to work, and then optimize the speed if necessary.

 def cumulative_func(df): results = [] for group in df.groupby('idx').groups.itervalues(): total = 0 result = [] for p, v in df.ix[group, ['pct_val', 'val']].values: if np.isnan(p): total += v else: total *= (1 + .01 * p) result.append(total) results.append(pd.Series(result, index=group)) return pd.concat(results).reindex(df.index) df['cumulative_func'] = cumulative_func(df) >>> df date idx pct_val val cumulative_func 3 2016-04-30 2 NaN 10 10.0 0 2016-04-30 1 NaN 10 10.0 4 2016-05-31 2 -10 0 9.0 1 2016-05-31 1 -10 0 9.0 5 2016-06-31 2 -10 0 8.1 2 2016-06-31 1 NaN 5 14.0 
+4
source

First I cleared your setup

Customization

 df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2], 'date': ['2016-04-30', '2016-05-31', '2016-06-31', '2016-04-30', '2016-05-31', '2016-06-31'], 'val': [10, 0, 5, 10, 0, 0], 'pct_val': [None, -10, None, None, -10, -10]}) df = df.sort_values(['date', 'idx']) print df 

It looks like:

  date idx pct_val val 0 2016-04-30 1 NaN 10 3 2016-04-30 2 NaN 10 1 2016-05-31 1 -10.0 0 4 2016-05-31 2 -10.0 0 2 2016-06-31 1 NaN 5 5 2016-06-31 2 -10.0 0 

Decision

 def cumcustom(df): df = df.copy() running_total = 0 for idx, row in df.iterrows(): if pd.isnull(row.ix['pct_val']): running_total += row.ix['val'] else: running_total *= row.ix['pct_val'] / 100. + 1 df.loc[idx, 'cumcustom'] = running_total return df 

Then apply

 df.groupby('idx').apply(cumcustom).reset_index(drop=True).sort_values(['date', 'idx']) 

It looks like:

  date idx pct_val val cumcustom 0 2016-04-30 1 NaN 10 10.0 3 2016-04-30 2 NaN 10 10.0 1 2016-05-31 1 -10.0 0 9.0 4 2016-05-31 2 -10.0 0 9.0 2 2016-06-31 1 NaN 5 14.0 5 2016-06-31 2 -10.0 0 8.1 
+1
source

All Articles