Pandas groupby and roll_apply ignoring NaNs

I have a pandas framework and I want to calculate the average value of a column rental (after a groupby clause). However, I want to exclude NaNs.

For example, if groupby returns [2, NaN, 1], the result should be 1.5, and currently it returns NaN.

I tried the following, but it does not work:

df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN'])) 

If I even try this:

 df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: 1) 

I get NaN in the output, so it must have something to do with how pandas works in the background.

Any ideas?

EDIT: Here is a sample code with what I'm trying to do:

 import pandas as pd import numpy as np df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] }) print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN'])) 

Result:

 0 NaN 1 NaN 2 2.0 3 NaN 4 2.5 5 NaN 6 3.0 7 2.0 

while I would like to have the following:

 0 NaN 1 NaN 2 2.0 3 2.0 4 2.5 5 3.0 6 3.0 7 2.0 
+7
python pandas nan dataframe pandas-groupby
source share
3 answers

As always with pandas, sticking with vectorized methods (i.e. avoiding apply ) is necessary for performance and scalability.

The operation you want to do is a bit complicated, because the rolling operations on groupby objects are currently not known to NaN (version 0.18.1). Thus, we need a few short lines of code:

 g1 = df.groupby(['var1'])['value'] # group values g2 = df.fillna(0).groupby(['var1'])['value'] # fillna, then group values s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation s.reset_index(level=0, drop=True).sort_index() # drop/sort index 

The idea is to summarize the values ​​in the window (using sum ), count the NaN values ​​(using count ), and then split to find the average value. This code gives the following result that matches your desired result:

 0 NaN 1 NaN 2 2.0 3 2.0 4 2.5 5 3.0 6 3.0 7 2.0 Name: value, dtype: float64 

Checking this on a larger DataFrame (about 100,000 rows), the execution time was less than 100 ms, much faster than any methods used based on the method.

It might be worth checking out different approaches to your actual data, as timings can be influenced by other factors, such as the number of groups. He is pretty sure that vectorized computing will win, however.


The approach shown above works well for simple calculations such as rolling average. It will work for more complex calculations (for example, standard deviation deviations), although the implementation is more active.

The general idea is to look at each simple procedure that runs quickly in pandas (e.g. sum ), and then fill in all the null values ​​with an identification element (e.g. 0 ). Then you can use groubpy and perform a rolling operation (e.g. .rolling(2).sum() ). The output is then combined with the outputs of other operations.

For example, to implement group variation of a calendar with NaN (whose standard deviation is the square root), we need to find "the average value of squares minus the square of the mean." Here's a sketch of what it might look like:

 def rolling_nanvar(df, window): """ Group df by 'var1' values and then calculate rolling variance, adjusting for the number of NaN values in the window. Note: user may wish to edit this function to control degrees of freedom (n), depending on their overall aim. """ g1 = df.groupby(['var1'])['value'] g2 = df.fillna(0).groupby(['var1'])['value'] # fill missing values with 0, square values and groupby g3 = df['value'].fillna(0).pow(2).groupby(df['var1']) n = g1.rolling(window).count() mean_of_squares = g3.rolling(window).sum() / n square_of_mean = (g2.rolling(window).sum() / n)**2 variance = mean_of_squares - square_of_mean return variance.reset_index(level=0, drop=True).sort_index() 

Please note that this function may not be numerically stable (squaring can lead to overflow). pandas uses the Welford algorithm to mitigate this problem.

In any case, this function, although it uses several operations, is still very fast. Here is a comparison with a shorter method based on the application proposed by Yak Pirozhenko :

 >>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows >>> %timeit df2.groupby('var1')['value'].apply(\ lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar)) 1 loops, best of 3: 11 s per loop >>> %timeit rolling_nanvar(df2, 7) 10 loops, best of 3: 110 ms per loop 

In this case, venereisation is 100 times faster. Of course, depending on how much data you have, you can use apply , as this allows you generality / brevity at the expense of performance.

+8
source share

Can this result meet your expectations? I changed your decision a bit with the min_periods parameter and the right filter for nan.

 In [164]: df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if not np.isnan(i)]), min_periods=1) Out[164]: 0 1.0 1 2.0 2 2.0 3 2.0 4 2.5 5 3.0 6 3.0 7 2.0 dtype: float64 
+1
source share

Here is an alternative implementation without list comprehension, but also does not populate the first np.nan output np.nan

 means = df.groupby('var1')['value'].apply( lambda gp: gp.rolling(2, min_periods=1).apply(np.nanmean)) 
+1
source share

All Articles