As always with pandas, sticking with vectorized methods (i.e. avoiding apply ) is necessary for performance and scalability.
The operation you want to do is a bit complicated, because the rolling operations on groupby objects are currently not known to NaN (version 0.18.1). Thus, we need a few short lines of code:
g1 = df.groupby(['var1'])['value'] # group values g2 = df.fillna(0).groupby(['var1'])['value'] # fillna, then group values s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation s.reset_index(level=0, drop=True).sort_index() # drop/sort index
The idea is to summarize the values in the window (using sum ), count the NaN values (using count ), and then split to find the average value. This code gives the following result that matches your desired result:
0 NaN 1 NaN 2 2.0 3 2.0 4 2.5 5 3.0 6 3.0 7 2.0 Name: value, dtype: float64
Checking this on a larger DataFrame (about 100,000 rows), the execution time was less than 100 ms, much faster than any methods used based on the method.
It might be worth checking out different approaches to your actual data, as timings can be influenced by other factors, such as the number of groups. He is pretty sure that vectorized computing will win, however.
The approach shown above works well for simple calculations such as rolling average. It will work for more complex calculations (for example, standard deviation deviations), although the implementation is more active.
The general idea is to look at each simple procedure that runs quickly in pandas (e.g. sum ), and then fill in all the null values with an identification element (e.g. 0 ). Then you can use groubpy and perform a rolling operation (e.g. .rolling(2).sum() ). The output is then combined with the outputs of other operations.
For example, to implement group variation of a calendar with NaN (whose standard deviation is the square root), we need to find "the average value of squares minus the square of the mean." Here's a sketch of what it might look like:
def rolling_nanvar(df, window): """ Group df by 'var1' values and then calculate rolling variance, adjusting for the number of NaN values in the window. Note: user may wish to edit this function to control degrees of freedom (n), depending on their overall aim. """ g1 = df.groupby(['var1'])['value'] g2 = df.fillna(0).groupby(['var1'])['value'] # fill missing values with 0, square values and groupby g3 = df['value'].fillna(0).pow(2).groupby(df['var1']) n = g1.rolling(window).count() mean_of_squares = g3.rolling(window).sum() / n square_of_mean = (g2.rolling(window).sum() / n)**2 variance = mean_of_squares - square_of_mean return variance.reset_index(level=0, drop=True).sort_index()
Please note that this function may not be numerically stable (squaring can lead to overflow). pandas uses the Welford algorithm to mitigate this problem.
In any case, this function, although it uses several operations, is still very fast. Here is a comparison with a shorter method based on the application proposed by Yak Pirozhenko :
>>> df2 = pd.concat([df]*10000, ignore_index=True)
In this case, venereisation is 100 times faster. Of course, depending on how much data you have, you can use apply , as this allows you generality / brevity at the expense of performance.