I have a very simple setup: market data (ticks) in pandas dataframe df, for example:
index period ask bid 00:00:00.126 42125 112.118 112.117 00:00:00.228 42125 112.120 112.117 00:00:00.329 42125 112.121 112.120 00:00:00.380 42125 112.123 112.120 00:00:00.432 42125 112.124 112.121 00:00:00.535 41126 112.124 112.121 00:00:00.586 41126 112.122 112.121 00:00:00.687 41126 112.124 112.121 00:00:01.198 41126 112.124 112.120 00:00:01.737 41126 112.124 112.121 00:00:02.243 41126 112.123 112.121
Now I use pandas.groupy to aggregate periods
g=df.groupby('period')
It is easy to get the minimum and maximum prices for a period, for example.
import numpy as np res=g.agg({'ask': [np.amax, np.amin]})
This is also fast enough. Now I also need the first and last price for the period. This is where the problem begins. Of course I can do:
res=g.agg({'ask': lambda x: x[0]})
and it works mostly, but for large datasets it is very slow. Basically, the overhead for calling a Python function is just huge.
Does anyone know of a numpy function similar to np.amax that will return the first or last element of a group? I could not find him. iloc [0] does not do the trick because it is an object method and therefore I cannot pass it as a g.agg function because at this stage I do not have an object (which is what lambda is required).
Now I'm not lazy, and I tried to do this for myself using cython.
import numpy as np cimport numpy as np cpdef double first(np.ndarray array_series): return array_series[0]
But pandas will not accept this as an aggregation function because it passes the pd.core.series object, not np.ndarray. (Nevermind one comes from the other, the compiler does not recognize this.)
Does anyone know how to write a cython function that accepts a series of pandas without the overhead of python?