Faster way to remove outliers by groups in large pandas DataFrames

I have a relatively large DataFrame (about a million rows, hundreds of columns) and I would like to click outliers in each column by group. By "clip outliers for each column by group" I mean - calculate the quanta of 5% and 95% for each column in the group and the clip values ​​outside this quantile range.

Here is the setting I'm using now:

def winsorize_series(s): q = s.quantile([0.05, 0.95]) if isinstance(q, pd.Series) and len(q) == 2: s[s < q.iloc[0]] = q.iloc[0] s[s > q.iloc[1]] = q.iloc[1] return s def winsorize_df(df): return df.apply(winsorize_series, axis=0) 

and then, with my DataFrame called features and indexed on DATE , I can do

 grouped = features.groupby(level='DATE') result = grouped.apply(winsorize_df) 

This works, except that it is very slow, presumably due to nested apply calls: one in each group, and then one for each column in each group. I tried to get rid of the second apply by calculating the quantiles for all columns at once, but got stuck trying to spawn each column with a different value. Is there a faster way to complete this procedure?

+12
python pandas
source share
4 answers

There is a winsorize function in scipy.stats.mstats that you can use. Note, however, that it returns slightly different values ​​than winsorize_series :

 In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0] Out[126]: 0.95000000000000007 In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0] Out[127]: 1.0 

Using mstats.winsorize instead of winsorize_series possible (depending on N, M, P) ~ 1.5x faster:

 import numpy as np import pandas as pd from scipy.stats import mstats def using_mstats_df(df): return df.apply(using_mstats, axis=0) def using_mstats(s): return mstats.winsorize(s, limits=[0.05, 0.05]) N, M, P = 10**5, 10, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] grouped = df.groupby(level='DATE') 

 In [122]: %timeit result = grouped.apply(winsorize_df) 1 loops, best of 3: 17.8 s per loop In [123]: %timeit mstats_result = grouped.apply(using_mstats_df) 1 loops, best of 3: 11.2 s per loop 
+7
source share

I found a pretty simple way to get this to work using the conversion method in pandas.

 from scipy.stats import mstats def winsorize_series(group): return mstats.winsorize(group, limits=[lower_lim,upper_lim]) grouped = features.groupby(level='DATE') result = grouped.transform(winsorize_series) 
+2
source share

A good way to get closer to this is with vectorization. And for this I like to use np.where .

 import pandas as pd import numpy as np from scipy.stats import mstats import timeit data = pd.Series(range(20), dtype='float') def WinsorizeCustom(data): quantiles = data.quantile([0.05, 0.95]) q_05 = quantiles.loc[0.05] q_95 = quantiles.loc[0.95] out = np.where(data.values <= q_05,q_05, np.where(data >= q_95, q_95, data) ) return out 

For comparison, I wrapped a function from scipy into a function:

 def WinsorizeStats(data): out = mstats.winsorize(data, limits=[0.05, 0.05]) return out 

But, as you can see, although my function is pretty fast, its still far from the Scipy implementation:

 %timeit WinsorizeCustom(data) #1000 loops, best of 3: 842 µs per loop %timeit WinsorizeStats(data) #1000 loops, best of 3: 212 µs per loop 

If you are interested in learning more about pandas code acceleration, I would suggest optimizing pandas for speed and From Python to Numpy .

+2
source share

Here is a solution without using scipy.stats.mstats:

 def clip_series(s, lower, upper): clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1) return clipped # Manage list of features to be winsorized feature_list = list(features.columns) for f in feature_list: features[f] = clip_series(features[f], 0.05, 0.95) 
0
source share

All Articles