Accelerating the past is a 60-day average in pandas

I use data from a past task based on panel data across several stores and a period of 2.5 years. Each observation includes the number of customers at a particular storage date. For each storage date, my goal is to calculate the average number of customers who visited this store over the past 60 days.

Below is the code that does exactly what I need. However, this lasts forever - it takes a night to process c.800k lines. I'm looking for a smart way to achieve the same goal faster.

I included 5 observations of the original data set with the corresponding variables: store id (Store), Date and the number of customers ("Customers").

Note:

  • For each line in the iteration, I end up writing the results using .loc instead, for example. row ["Delayed number of clients"], because the "row" does not write anything in the cells. I wonder why this is the case.
  • I usually populate new columns using "apply, axis = 1", so I would really appreciate any solution based on this. I found that β€œapply” works fine when for each row the calculation is done through columns using values ​​at the same row level. However, I do not know how the "apply" function can include different lines, which is what this problem requires. the only exception I've seen so far is "diff", which is not useful here.

Thanks.


Sample data:

pd.DataFrame({ 'Store': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'Customers': {0: 668, 1: 578, 2: 619, 3: 635, 4: 785}, 'Date': { 0: pd.Timestamp('2013-01-02 00:00:00'), 1: pd.Timestamp('2013-01-03 00:00:00'), 2: pd.Timestamp('2013-01-04 00:00:00'), 3: pd.Timestamp('2013-01-05 00:00:00'), 4: pd.Timestamp('2013-01-07 00:00:00') } }) 

Code that works, but incredibly slow:

 import pandas as pd import numpy as np data = pd.read_csv("Rossman - no of cust/dataset.csv") data.Date = pd.to_datetime(data.Date) data.Customers = data.Customers.astype(int) for index, row in data.iterrows(): d = row["Date"] store = row["Store"] time_condition = (d - data["Date"]<np.timedelta64(60, 'D')) & (d > data["Date"]) sub_df = data.loc[ time_condition & (data["Store"] == store), :] data.loc[ (data["Date"]==d) & (data["Store"] == store), "Lagged No customers"] = sub_df["Customers"].sum() data.loc[ (data["Date"]==d) & (data["Store"] == store), "No of days"] = len(sub_df["Customers"]) if len(sub_df["Customers"]) > 0: data.loc[ (data["Date"]==d) & (data["Store"] == store), "Av No of customers"] = int(sub_df["Customers"].sum()/len(sub_df["Customers"])) 
+7
python pandas apply
source share
1 answer

Given your small sample data, I used a two-day moving average, not 60 days.

 >>> (pd.rolling_mean(data.pivot(columns='Store', index='Date', values='Customers'), window=2) .stack('Store')) Date Store 2013-01-03 1 623.0 2013-01-04 1 598.5 2013-01-05 1 627.0 2013-01-07 1 710.0 dtype: float64 

By taking a data bar with dates as your index and saving them as columns, you can simply take a moving average. Then you need to fold the storages to get the data back in the correct form.

The following is an example of outputting source data to the last stack:

 Store 1 2 3 Date 2015-07-29 541.5 686.5 767.0 2015-07-30 534.5 664.0 769.5 2015-07-31 550.5 613.0 822.0 

After .stack('Store') it will be:

 Date Store 2015-07-29 1 541.5 2 686.5 3 767.0 2015-07-30 1 534.5 2 664.0 3 769.5 2015-07-31 1 550.5 2 613.0 3 822.0 dtype: float64 

Assuming the above is named df , you can then combine it back into the original data as follows:

 data.merge(df.reset_index(), how='left', on=['Date', 'Store']) 

EDIT : There is a clear seasonal picture in the data for which you can make adjustments. In any case, you probably want your moving average to be a few seven to represent even weeks. I used the time window of 63 days in the example below (9 weeks).

In order to avoid data loss in the stores that are just opening (and at the beginning of the time period), you can specify min_periods=1 in the moving average tool. This will give you an average of all available observations for the selected time.

 df = data.loc[data.Customers > 0, ['Date', 'Store', 'Customers']] result = (pd.rolling_mean(df.pivot(columns='Store', index='Date', values='Customers'), window=63, min_periods=1) .stack('Store')) result.name = 'Customers_63d_mvg_avg' df = df.merge(result.reset_index(), on=['Store', 'Date'], how='left') >>> df.sort_values(['Store', 'Date']).head(8) Date Store Customers Customers_63d_mvg_avg 843212 2013-01-02 1 668 668.000000 842103 2013-01-03 1 578 623.000000 840995 2013-01-04 1 619 621.666667 839888 2013-01-05 1 635 625.000000 838763 2013-01-07 1 785 657.000000 837658 2013-01-08 1 654 656.500000 836553 2013-01-09 1 626 652.142857 835448 2013-01-10 1 615 647.500000 

To see more clearly what is happening, here is an example of a toy:

 s = pd.Series([1,2,3,4,5] + [np.NaN] * 2 + [6]) >>> pd.concat([s, pd.rolling_mean(s, window=4, min_periods=1)], axis=1) 0 1 0 1 1.0 1 2 1.5 2 3 2.0 3 4 2.5 4 5 3.5 5 NaN 4.0 6 NaN 4.5 7 6 5.5 

The window consists of four observations, but note that the final value of 5.5 is (5 + 6) / 2. Values ​​4.0 and 4.5 are (3 + 4 + 5) / 3 and (4 + 5) / 2, respectively.

In our example, the NaN rows of the pivot table are not merged back into df , because we made a left join, and all rows in df have one or more Clients.

You can view the rolling data graph as follows:

 df.set_index(['Date', 'Store']).unstack('Store').plot(legend=False) 

enter image description here

+6
source share

All Articles