Speeding up custom aggregation functions

Question

Speeding up custom aggregation functions

I have a very simple setup: market data (ticks) in pandas dataframe df, for example:

index period ask bid 00:00:00.126 42125 112.118 112.117 00:00:00.228 42125 112.120 112.117 00:00:00.329 42125 112.121 112.120 00:00:00.380 42125 112.123 112.120 00:00:00.432 42125 112.124 112.121 00:00:00.535 41126 112.124 112.121 00:00:00.586 41126 112.122 112.121 00:00:00.687 41126 112.124 112.121 00:00:01.198 41126 112.124 112.120 00:00:01.737 41126 112.124 112.121 00:00:02.243 41126 112.123 112.121

Now I use pandas.groupy to aggregate periods

 g=df.groupby('period')

It is easy to get the minimum and maximum prices for a period, for example.

 import numpy as np res=g.agg({'ask': [np.amax, np.amin]})

This is also fast enough. Now I also need the first and last price for the period. This is where the problem begins. Of course I can do:

 res=g.agg({'ask': lambda x: x[0]})

and it works mostly, but for large datasets it is very slow. Basically, the overhead for calling a Python function is just huge.

Does anyone know of a numpy function similar to np.amax that will return the first or last element of a group? I could not find him. iloc [0] does not do the trick because it is an object method and therefore I cannot pass it as a g.agg function because at this stage I do not have an object (which is what lambda is required).

Now I'm not lazy, and I tried to do this for myself using cython.

 import numpy as np cimport numpy as np cpdef double first(np.ndarray array_series): return array_series[0]

But pandas will not accept this as an aggregation function because it passes the pd.core.series object, not np.ndarray. (Nevermind one comes from the other, the compiler does not recognize this.)

Does anyone know how to write a cython function that accepts a series of pandas without the overhead of python?

+6

python numpy pandas csv cython

user5507059 Oct 30 '15 at 13:07

source share

2 answers

An alternative is to simply reselect and use OHLC (open=first,close=last,high=max,low=min)

 In [56]: df = DataFrame({'A' : np.arange(10), 'B' : pd.date_range('20130101',periods=5).tolist()*2}) In [57]: df Out[57]: AB 0 0 2013-01-01 1 1 2013-01-02 2 2 2013-01-03 3 3 2013-01-04 4 4 2013-01-05 5 5 2013-01-01 6 6 2013-01-02 7 7 2013-01-03 8 8 2013-01-04 9 9 2013-01-05 In [58]: df.set_index('B').resample('D',how='ohlc') Out[58]: A open high low close B 2013-01-01 0 5 0 5 2013-01-02 1 6 1 6 2013-01-03 2 7 2 7 2013-01-04 3 8 3 8 2013-01-05 4 9 4 9

+1

Jeff Oct 30 '15 at 14:16

source share

Edchum · Accepted Answer · 2015-10-30T13:20:00+0000

IIUC, then you can do first and last :

 In [270]: g=df.groupby('period') res=g.agg({'ask': [np.amax, np.amin, 'first', 'last']}) res Out[270]: ask amax amin first last period 41126 112.124 112.122 112.124 112.123 42125 112.124 112.118 112.118 112.124

Speeding up custom aggregation functions

More articles: