There is a winsorize function in scipy.stats.mstats that you can use. Note, however, that it returns slightly different values than winsorize_series :
In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0] Out[126]: 0.95000000000000007 In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0] Out[127]: 1.0
Using mstats.winsorize instead of winsorize_series possible (depending on N, M, P) ~ 1.5x faster:
import numpy as np import pandas as pd from scipy.stats import mstats def using_mstats_df(df): return df.apply(using_mstats, axis=0) def using_mstats(s): return mstats.winsorize(s, limits=[0.05, 0.05]) N, M, P = 10**5, 10, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] grouped = df.groupby(level='DATE')
In [122]: %timeit result = grouped.apply(winsorize_df) 1 loops, best of 3: 17.8 s per loop In [123]: %timeit mstats_result = grouped.apply(using_mstats_df) 1 loops, best of 3: 11.2 s per loop
unutbu
source share