Cumulative OLS with Python Pandas

I am using Pandas 0.8.1, and at the moment I can not change the version. If a newer version helps solve the problem below, pay attention to the comment, not the answer. This also applies to the research replication project, so although re-performing the regression after adding just one new data point can be dumb (if the data set is large), I still have to do it. Thanks!

In Pandas, there is a rolling parameter for the window_type argument for pandas.ols , but it seems implicit that this requires a certain choice of window size or the use of the entire default data sample. I am looking instead to use all the data in a cumulative way.

I am trying to run a regression on pandas.DataFrame , which is sorted by date. For each index i I want to run a regression using the data available from the minimum date to the date in index i . Thus, the window grows effectively by one at each iteration, all data is cumulatively used from the earliest observation, and no data ever falls out of the window.

I wrote a function (below) that works with apply to accomplish this, but this is unacceptably slow. Instead, is there a way to use pandas.ols to directly perform this kind of cumulative regression?

Here are some details about my data. I have a pandas.DataFrame containing an identifier column, a date column, a column with left-side values ​​and a column of right values. I want to use groupby to group based on an identifier, and then perform cumulative regression for each time period consisting of left and right variables.

Here is a function that I can use with apply in an object with an identifier:

 def cumulative_ols( data_frame, lhs_column, rhs_column, date_column, min_obs=60 ): beta_dict = {} for dt in data_frame[date_column].unique(): cur_df = data_frame[data_frame[date_column] <= dt] obs_count = cur_df[lhs_column].notnull().sum() if min_obs <= obs_count: beta = pandas.ols( y=cur_df[lhs_column], x=cur_df[rhs_column], ).beta.ix['x'] ### else: beta = np.NaN ### beta_dict[dt] = beta ### beta_df = pandas.DataFrame(pandas.Series(beta_dict, name="FactorBeta")) beta_df.index.name = date_column return beta_df 
+4
source share
1 answer

Following the recommendations in the comments, I created my own function that can be used with apply and which relies on cumsum to accumulate all the individual necessary terms to express the coefficient from the one-dimensional OLS regression in a vectorial relation.

 def cumulative_ols( data_frame, lhs_column, rhs_column, date_column, min_obs=60, ): """ Function to perform a cumulative OLS on a Pandas data frame. It is meant to be used with `apply` after grouping the data frame by categories and sorting by date, so that the regression below applies to the time series of a single category data and the use of `cumsum` will work appropriately given sorted dates. It is also assumed that the date conventions of the left-hand-side and right-hand-side variables have been arranged by the user to match up with any lagging conventions needed. This OLS is implicitly univariate and relies on the simplification to the formula: Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y) Var(x) ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2 beta ~ Cov(x,y) / Var(x) and the code makes a further simplification be cancelling one factor of (1/n). Notes: one easy improvement is to change the date column to a generic sort column since there no special reason the regressions need to be time- series specific. """ data_frame["xy"] = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0) data_frame["x2"] = (data_frame[rhs_column]**2).fillna(0.0) data_frame["yobs"] = data_frame[lhs_column].notnull().map(int) data_frame["xobs"] = data_frame[rhs_column].notnull().map(int) data_frame["cum_yobs"] = data_frame["yobs"].cumsum() data_frame["cum_xobs"] = data_frame["xobs"].cumsum() data_frame["cumsum_xy"] = data_frame["xy"].cumsum() data_frame["cumsum_x2"] = data_frame["x2"].cumsum() data_frame["cumsum_x"] = data_frame[rhs_column].fillna(0.0).cumsum() data_frame["cumsum_y"] = data_frame[lhs_column].fillna(0.0).cumsum() data_frame["cum_cov"] = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"] data_frame["cum_x_var"] = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2 data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"] data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN return data_frame[[date_column, "FactorBeta"]].set_index(date_column) ### End cumulative_ols 

In numerous tests, I checked that this corresponds to the output of my previous function and the output of the NumPy function linalg.lstsq . I did not do a full time test, but, strangely enough, in those cases when I worked, it is about 50 times faster.

0
source

All Articles