How to speed up a very slow pandas app function?

Question

How to speed up a very slow pandas app function?

I have a very large pandas dataset and at some point I need to use the following function

def proc_trader(data): data['_seq'] = np.nan # make every ending of a roundtrip with its index data.ix[data.cumq == 0,'tag'] = np.arange(1, (data.cumq == 0).sum() + 1) # backfill the roundtrip index until previous roundtrip; # then fill the rest with 0s (roundtrip incomplete for most recent trades) data['_seq'] =data['tag'].fillna(method = 'bfill').fillna(0) return data['_seq'] # btw, why on earth this function returns a dataframe instead of the series `data['_seq']`??

and i use apply

 reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader)

Obviously, I cannot share the data here, but do you see a bottleneck in my code? Could it be arange ? There are many name-productid combinations in the data.

Minimum working example:

 import pandas as pd import numpy as np reshaped= pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]}) reshaped.sort_values(by=['trader', 'stock','day'], inplace=True) reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.transform('cumsum') reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader).reset_index()['_seq']

+6

performance python pandas

ℕʘʘḆḽḘ Mar 16 '16 at 19:10

source share

1 answer

John · Accepted Answer · 2016-03-16T22:04:56+0000

Nothing special seemed here, just redone in several places. There is really no need to enter a function, so I did not. In this tiny sample, the data is about twice as fast as the original.

 reshaped.sort_values(by=['trader', 'stock','day'], inplace=True) reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.cumsum() reshaped.loc[ reshaped.cumq == 0, '_spell' ] = 1 reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].cumsum() reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].bfill().fillna(0)

Result:

  day delta out stock trader cumq _spell 0 0 10 1 aa 10 1.0 1 1 -10 1 aa 0 1.0 2 2 15 2 aa 15 2.0 3 4 -10 2 aa 5 2.0 4 5 -5 2 aa 0 2.0 5 10 5 0 aa 5 0.0 6 1 0 1 ba 0 1.0

How to speed up a very slow pandas app function?

More articles: