How to vectorize Pandas calculation based on the last x lines of data

Question

How to vectorize Pandas calculation based on the last x lines of data

I have a pretty sophisticated prediction code with over 20 columns and millions of data per column using wls. Now I use iterrow to scroll through the dates, and then, based on these dates and the values on those dates, it extracts different data sizes for calculation. It takes several hours to complete my work, I simplify the code as follows:

import pandas as pd import numpy as np from datetime import timedelta df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB')) df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D') def calculateC(A, dte): if A>0: #based on values has different cutoff length for trend prediction depth=10 else: depth=20 lastyear=(dte-timedelta(days=365)) df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction return df2.B.mean() #uses WLS in my model but for simplification replace with mean for index, row in df.iterrows(): if index>365: df.loc[index,'C']=calculateC(row.A, row.dte)

I read that iterrow is the main reason because it is not an efficient way to use Pandas, and I have to use vector methods. However, it seems that I cannot find the path to the vector based on conditions (dates, different lengths and range of values). Is there any way?

+7

python pandas

desmond Jun 26 '16 at 4:48

source share

2 answers

John · Answer 1 · 2016-12-20T03:34:10+0000

I have good news and bad news. The good news: I have something vectorized, which is about 300 times faster, but the bad news is that I cannot reproduce the results. But I think you should use the principles here to speed up your code significantly, even if that code does not actually replicate your results at the moment.

 df['result'] = np.where( df['A'] > 0, df.shift(365).rolling(10).B.mean(), df.shift(365).rolling(20).B.mean() )

The hard (slow) part of your code is this:

 df2=df[df.dte<lastyear].head(depth)

However, while your dates are only 365 days away, you can use a code that is vectorized and much faster:

 df.shift(365).rolling(10).B.mean()

shift(365) replaces df.dte < lastyear , and rolling().mean() replaces head().mean() . It will be much faster and less memory.

And in fact, even if your dates are not completely regular, you can probably remake and work that way. Or, somewhat equivalent, if you specify the date of your index, the shift can be made to work on the basis of frequency, not lines (for example, a shift of 365 days, even if it is not 365 lines). It would probably be nice to "dte" your index here independently.

Chih-Hsu Jack Lin · Answer 2 · 2016-08-23T21:15:01+0000

I would try pandas.DataFrame.apply (func, axis = 1)

 def calculateC2(row): if row.name >365: # row.name is the index of the row if row.A >0: #based on values has different cutoff length for trend prediction depth=10 else: depth=20 lastyear=(row.dte-timedelta(days=365)) df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean df.apply(calculateC2,axis=1)

How to vectorize Pandas calculation based on the last x lines of data

More articles: