I have a very large pandas dataset and at some point I need to use the following function
def proc_trader(data): data['_seq'] = np.nan
and i use apply
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader)
Obviously, I cannot share the data here, but do you see a bottleneck in my code? Could it be arange ? There are many name-productid combinations in the data.
Minimum working example:
import pandas as pd import numpy as np reshaped= pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]}) reshaped.sort_values(by=['trader', 'stock','day'], inplace=True) reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.transform('cumsum') reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader).reset_index()['_seq']