Pandas - Scrolling window - pixelated spacing

I had a problem calculating the number of transactions applicable to each individual customer in this dataset, structured as follows:

userID itemID transaction_ts 3229 4493320 2016-01-02 14:55:00 3229 4492492 2016-01-02 14:57:02 3229 4496756 2016-01-04 09:01:18 3229 4493673 2016-01-04 09:11:10 3229 4497531 2016-01-04 11:05:25 3229 4495006 2016-01-05 07:25:11 4330 4500695 2016-01-02 09:17:21 4330 4500656 2016-01-03 09:19:28 4330 4503087 2016-01-04 07:42:15 4330 4501846 2016-01-04 08:55:24 4330 4504105 2016-01-04 09:59:35 

Ideally, it would look like the one shown below for the window of the account of rolling transactions, for example. 24 hours:

 userID itemID transaction_ts rolling_count 3229 4493320 2016-01-02 14:55:00 1 3229 4492492 2016-01-02 14:57:02 2 3229 4496756 2016-01-04 09:01:18 1 3229 4493673 2016-01-04 09:11:10 2 3229 4497531 2016-01-04 11:05:25 3 3229 4495006 2016-01-05 07:25:11 4 4330 4500695 2016-01-02 09:17:21 1 4330 4500656 2016-01-03 09:19:28 1 4330 4503087 2016-01-04 07:42:15 2 4330 4501846 2016-01-04 08:55:24 3 4330 4504105 2016-01-04 09:59:35 3 

There is a great answer to a similar problem: pandas current sum of the last five minutes

However, this answer depends entirely on the timestamp field, unlike above, where the number of transitions should be reset to 1 when you encounter a transaction from another user to the level indicated above. You can find a solution for cutting, but given the size of this data set (possibly 1 m + rows), which is impossible.

It is important that the window should reflect the 24-hour period before the transaction_combinations of the corresponding row, so I think that the suitable df.apply or roll_window method is suitable, I just can’t figure out how to conditional this user id.

+5
source share
2 answers

Part of the solution (rolling cumsum) may already be here . (I just changed the type of delay):

 from datetime import timedelta def msum(s, lag): lag = s.index - timedelta(days=lag) inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64)) cs = s.cumsum() return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index) 

The function requires the index to be of the date and time type. In addition, the index in each userID group should already be sorted (for example, as in your example).

 df = df.set_index('transaction_ts') df['rolling_count'] = 1 df['rolling_count'] = df.groupby('userID', sort=False)['rolling_count'].transform(lambda x : msum(x,1)) 

The grouping option sort=False may give some speed. (He is responsible for sorting the group keys.)

+2
source

I managed to get a solution that works at least on a test suite. ptjr got there first! The first solution to this problem Pandas Rolling Computations on sliding Windows (unevenly distributed) helped a lot.

As ptrj previously pointed out - using the df.groupby ('userID') key is the key.

 df = pd.read_excel('velocity.xlsx') # reading dataframe in df = df.sort_values(['userID','transaction_ts']) df = df.reset_index(drop=True) # ensure index is sorted according to userID|transaction_ts df['ones'] = 1 def add_rolling_count(x,number_of_hours): x['lag'] = x['transaction_ts'] - timedelta(hours=number_of_hours) inds = np.searchsorted(np.array(x['transaction_ts'].astype(np.int64)), np.array(x['lag'].astype(np.int64))) cs = x['ones'].reset_index(drop=True).cumsum() x['count'] = cs.values - cs[inds].values + 1 return x` df = df.groupby('user_id').apply(lambda x: add_rolling_count(x, 24)) 
0
source

All Articles