I am using Pandas 0.8.1, and at the moment I can not change the version. If a newer version helps solve the problem below, pay attention to the comment, not the answer. This also applies to the research replication project, so although re-performing the regression after adding just one new data point can be dumb (if the data set is large), I still have to do it. Thanks!
In Pandas, there is a rolling parameter for the window_type argument for pandas.ols , but it seems implicit that this requires a certain choice of window size or the use of the entire default data sample. I am looking instead to use all the data in a cumulative way.
I am trying to run a regression on pandas.DataFrame , which is sorted by date. For each index i I want to run a regression using the data available from the minimum date to the date in index i . Thus, the window grows effectively by one at each iteration, all data is cumulatively used from the earliest observation, and no data ever falls out of the window.
I wrote a function (below) that works with apply to accomplish this, but this is unacceptably slow. Instead, is there a way to use pandas.ols to directly perform this kind of cumulative regression?
Here are some details about my data. I have a pandas.DataFrame containing an identifier column, a date column, a column with left-side values ββand a column of right values. I want to use groupby to group based on an identifier, and then perform cumulative regression for each time period consisting of left and right variables.
Here is a function that I can use with apply in an object with an identifier:
def cumulative_ols( data_frame, lhs_column, rhs_column, date_column, min_obs=60 ): beta_dict = {} for dt in data_frame[date_column].unique(): cur_df = data_frame[data_frame[date_column] <= dt] obs_count = cur_df[lhs_column].notnull().sum() if min_obs <= obs_count: beta = pandas.ols( y=cur_df[lhs_column], x=cur_df[rhs_column], ).beta.ix['x']