Implementing the Empirical Distribution Function of Python (ecdf)

Question

Implementing the Empirical Distribution Function of Python (ecdf)

I know statsmodels.tools.tools.ECDF , but since the calculation of the empirical cumulative distribution function (ECDF) is pretty straightforward, and I want to minimize the dependencies in my project, I want to code it manually.

In the given list() / np.array() Pandas.Series ECDF for each element can be calculated as specified on Wikipedia :

enter image description here

I have a Pandas DataFrame, dfser , below, and I want to get the ecdf of the values column. Dana and my two single-line solutions.

Is there a faster way to do this? Speed is important in my application.

 # Note that in my case indices are unique identifiers so I cannot reset them. import numpy as np import pandas as pd # all indices are unique, but there may be duplicate measurement values (that belong to different indices). dfser = pd.DataFrame({'group':['a','b','b','a','d','c','e','e','c','a','b','d','d','c','d','e','e','a'], 'values':[2.01899E-06, 1.12186E-07, 8.97467E-07, 2.91257E-06, 1.93733E-05, 0.00017889, 0.000120963, 4.27643E-07, 3.33614E-07, 2.08352E-12, 1.39478E-05, 4.28255E-08, 9.7619E-06, 8.51787E-09, 1.28344E-09, 3.5063E-05, 0.01732035,2.08352E-12]}, index = [123, 532, 235, 645, 747, 856, 345, 245, 845, 248, 901, 712, 162, 126, 198,748, 127,395] ) # My 1st Solution - list comprehension dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']] # My 2nd Solution - ranking dfser['rank'] = dfser['values'].rank(ascending = 0) dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser) dfser group values ecdf rank ecdf_r 123 a 2.018990e-06 0.555556 9.0 0.555556 532 b 1.121860e-07 0.333333 13.0 0.333333 235 b 8.974670e-07 0.500000 10.0 0.500000 645 a 2.912570e-06 0.611111 8.0 0.611111 747 d 1.937330e-05 0.777778 5.0 0.777778 856 c 1.788900e-04 0.944444 2.0 0.944444 345 e 1.209630e-04 0.888889 3.0 0.888889 245 e 4.276430e-07 0.444444 11.0 0.444444 845 c 3.336140e-07 0.388889 12.0 0.388889 248 a 2.083520e-12 0.111111 17.5 0.083333 901 b 1.394780e-05 0.722222 6.0 0.722222 712 d 4.282550e-08 0.277778 14.0 0.277778 162 d 9.761900e-06 0.666667 7.0 0.666667 126 c 8.517870e-09 0.222222 15.0 0.222222 198 d 1.283440e-09 0.166667 16.0 0.166667 748 e 3.506300e-05 0.833333 4.0 0.833333 127 e 1.732035e-02 1.000000 1.0 1.000000 395 a 2.083520e-12 0.111111 17.5 0.083333

+6

python numpy pandas ecdf

Rhubarb Feb 19 '14 at 17:16

source share

1 answer

CT Zhu · Accepted Answer · 2014-02-19T17:51:14+0000

Since you are already using pandas , I think it would be foolish not to use some of its functions:

 In [15]: import numpy as np from numpy import * sq=ser.value_counts() sq.sort_index().cumsum()*1./len(sq) Out[15]: 2.083520e-12 0.058824 1.283440e-09 0.117647 8.517870e-09 0.176471 4.282550e-08 0.235294 1.121860e-07 0.294118 3.336140e-07 0.352941 4.276430e-07 0.411765 8.974670e-07 0.470588 2.018990e-06 0.529412 2.912570e-06 0.588235 9.761900e-06 0.647059 1.394780e-05 0.705882 1.937330e-05 0.764706 3.506300e-05 0.823529 1.209630e-04 0.882353 1.788900e-04 0.941176 1.732035e-02 1.000000 dtype: float64

And speed comparison

 In [19]: %timeit sq.sort_index().cumsum()*1./len(sq) 1000 loops, best of 3: 344 µs per loop In [18]: %timeit ser.value_counts().sort_index().cumsum()*1./len(ser.value_counts()) 1000 loops, best of 3: 1.58 ms per loop In [17]: %timeit [sum( ser <= x)/float(len(ser)) for x in ser] 100 loops, best of 3: 3.31 ms per loop

If all values are unique, ser.value_counts() no longer required. This part is slow (selection of unique values). All you need in this case is just to sort the ser .

 In [23]: %timeit np.arange(1, ser.size+1)/float(ser.size) 10000 loops, best of 3: 11.6 µs per loop

The fastest version I can think of is to use vectorize:

 In [35]: np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size Out[35]: array([ 0.55555556, 0.33333333, 0.5 , 0.61111111, 0.77777778, 0.94444444, 0.88888889, 0.44444444, 0.38888889, 0.11111111, 0.72222222, 0.27777778, 0.66666667, 0.22222222, 0.16666667, 0.83333333, 1. , 0.11111111])

Add let see:

 In [37]: %timeit dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']] 100 loops, best of 3: 6 ms per loop In [38]: %%timeit dfser['rank'] = dfser['values'].rank(ascending = 0) dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser) 1000 loops, best of 3: 827 µs per loop In [39]: %timeit np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size 10000 loops, best of 3: 91.1 µs per loop

Implementing the Empirical Distribution Function of Python (ecdf)

More articles: