I had a problem calculating the number of transactions applicable to each individual customer in this dataset, structured as follows:
userID itemID transaction_ts 3229 4493320 2016-01-02 14:55:00 3229 4492492 2016-01-02 14:57:02 3229 4496756 2016-01-04 09:01:18 3229 4493673 2016-01-04 09:11:10 3229 4497531 2016-01-04 11:05:25 3229 4495006 2016-01-05 07:25:11 4330 4500695 2016-01-02 09:17:21 4330 4500656 2016-01-03 09:19:28 4330 4503087 2016-01-04 07:42:15 4330 4501846 2016-01-04 08:55:24 4330 4504105 2016-01-04 09:59:35
Ideally, it would look like the one shown below for the window of the account of rolling transactions, for example. 24 hours:
userID itemID transaction_ts rolling_count 3229 4493320 2016-01-02 14:55:00 1 3229 4492492 2016-01-02 14:57:02 2 3229 4496756 2016-01-04 09:01:18 1 3229 4493673 2016-01-04 09:11:10 2 3229 4497531 2016-01-04 11:05:25 3 3229 4495006 2016-01-05 07:25:11 4 4330 4500695 2016-01-02 09:17:21 1 4330 4500656 2016-01-03 09:19:28 1 4330 4503087 2016-01-04 07:42:15 2 4330 4501846 2016-01-04 08:55:24 3 4330 4504105 2016-01-04 09:59:35 3
There is a great answer to a similar problem: pandas current sum of the last five minutes
However, this answer depends entirely on the timestamp field, unlike above, where the number of transitions should be reset to 1 when you encounter a transaction from another user to the level indicated above. You can find a solution for cutting, but given the size of this data set (possibly 1 m + rows), which is impossible.
It is important that the window should reflect the 24-hour period before the transaction_combinations of the corresponding row, so I think that the suitable df.apply or roll_window method is suitable, I just canβt figure out how to conditional this user id.
source share