Pandas current amount of the last five minutes

Suppose I have a below data frame

Date, A 2014-11-21 11:00:00, 1 2014-11-21 11:03:00, 4 2014-11-21 11:04:00, 1 2014-11-21 11:05:00, 2 2014-11-21 11:07:00, 4 2014-11-21 11:08:00, 1 2014-11-21 11:12:00, 1 2014-11-21 11:13:00, 2 

The first column is a datetime object, and the second column is an integer. I want to calculate the sum of column "A" for the last five minutes for each row.

As an example, for row 2014-11-21 11:12:00, 1 sum of column "A" will be 2 (1 + 1), and the sum of column "A" for row 2014-11-21 11:05:00, 2 will be equal to 7 (2 + 1 + 4). The important thing is that the number of past lines for a time window (5 minutes) is not the same for each line (since time series are irregular).

How can I get the sum of the last five minutes for column "A" using the roll_sum method in pandas? Thanks in advance.

+11
source share
1 answer

In general, if the dates are completely arbitrary, I think you have to use Python for-loop line by line or use df.apply , (which also uses the Python loop under the hood.)

However, if your dates have a common frequency, as in the case above, then there is a trick that should be much faster than using df.apply : expand the timers according to the total frequency - in this case 1 minute - enter NaN with zeros, and then call rolling_sum :

 In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date']) Out[279]: A Date 2014-11-21 11:00:00 1 2014-11-21 11:03:00 5 2014-11-21 11:04:00 6 2014-11-21 11:05:00 7 2014-11-21 11:07:00 11 2014-11-21 11:08:00 8 2014-11-21 11:12:00 2 2014-11-21 11:13:00 3 

Of course, any time series has a common frequency, if you are ready to accept a fairly small amount of detail, but the required size of df.asfreq(...) can make this trick impractical.


Here is an example of a more general approach using df.apply . Note that calling searchsorted dependent on df['Date'] in sorted order.

 import numpy as np import pandas as pd df = pd.read_csv('data', parse_dates=[0], sep=',\s*') start_dates = df['Date'] - pd.Timedelta(minutes=5) df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right') df['end_index'] = np.arange(len(df)) def sum_window(row): return df['A'].iloc[row['start_index']:row['end_index']+1].sum() df['rolling_sum'] = df.apply(sum_window, axis=1) print(df[['Date', 'A', 'rolling_sum']]) 

gives

  Date A rolling_sum 0 2014-11-21 11:00:00 1 1 1 2014-11-21 11:03:00 4 5 2 2014-11-21 11:04:00 1 6 3 2014-11-21 11:05:00 2 7 4 2014-11-21 11:07:00 4 11 5 2014-11-21 11:08:00 1 8 6 2014-11-21 11:12:00 1 2 7 2014-11-21 11:13:00 2 3 

Here is a benchmark comparing the df.asfreq trick and the df.apply call:

 import numpy as np import pandas as pd df = pd.read_csv('data', parse_dates=[0], sep=',\s*') def big_df(df): df = df.copy() for i in range(7): dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute') df2 = pd.DataFrame({'Date': dates, 'A': df['A']}) df = pd.concat([df, df2]) df = df.reset_index(drop=True) return df def using_apply(): start_dates = df['Date'] - pd.Timedelta(minutes=5) df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right') df['end_index'] = np.arange(len(df)) def sum_window(row): return df['A'].iloc[row['start_index']:row['end_index']+1].sum() df['rolling_sum'] = df.apply(sum_window, axis=1) return df[['Date', 'rolling_sum']] def using_asfreq(): result = (pd.rolling_sum( df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])) return result 

 In [364]: df = big_df(df) In [367]: %timeit using_asfreq() 1000 loops, best of 3: 1.21 ms per loop In [368]: %timeit using_apply() 1 loops, best of 3: 208 ms per loop 
+13
source

All Articles