Python - OLS Regression Estimation Scrolling Window

For my assessment, I have a dataset found in this link ( https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk ), as in the following format. The third column (Y) in my dataset is my true value - this is what I wanted to predict (evaluate).

time XY 0.000543 0 10 0.000575 0 10 0.041324 1 10 0.041331 2 10 0.041336 3 10 0.04134 4 10 ... 9.987735 55 239 9.987739 56 239 9.987744 57 239 9.987749 58 239 9.987938 59 239 

I want to start casting, for example, 5 window OLS regression estimation , and I tried it with the following script.

 # /usr/bin/python -tt import numpy as np import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv('estimated_pred.csv') model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], window_type='rolling', window=5, intercept=True) df['Y_hat'] = model.y_predict print(df['Y_hat']) print (model.summary) df.plot.scatter(x='X', y='Y', s=0.1) 

A summary of the regression analysis is shown below.

  -------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <X> + <intercept> Number of Observations: 5 Number of Degrees of Freedom: 2 R-squared: -inf Adj R-squared: -inf Rmse: 0.0000 F-stat (1, 3): nan, p-value: nan Degrees of Freedom: model 1, resid 3 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- X 0.0000 0.0000 1.97 0.1429 0.0000 0.0000 intercept 239.0000 0.0000 14567091934632472.00 0.0000 239.0000 239.0000 ---------------------------------End of Summary--------------------------------- 

enter image description here

I want to make the inverse prediction of Y at t+1 (i.e. to predict the next value of Y according to the previous value, i.e. p(Y)t+1 , by including the mean square error ( MSE ) - for example, if we look at line 5, the value of X is 2, and the value of Y is 10. Let's say the prediction value ( p(Y)t+1 ) is 6, and so the MSE will be (10-6)^2 How can we do this using statsmodels or scikit-learn for pd.stats.ols.MovingOLS , was removed in Pandas version 0.20.0, and since I cannot find the link?

+1
source share
1 answer

Here is a brief description of how to collapse OLS using statsmodels and work with your data. just use df=pd.read_csv('estimated_pred.csv') instead of my randomly generated df:

 import pandas as pd import numpy as np import statsmodels.api as sm #random data #df=pd.DataFrame(np.random.normal(size=(500,3)),columns=['time','X','Y']) df=pd.read_csv('estimated_pred.csv') df=df.dropna() #uncomment this line to drop nans window = 5 df['a']=None #constant df['b1']=None #beta1 df['b2']=None #beta2 for i in range(window,len(df)): temp=df.iloc[i-window:i,:] RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit() df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0] df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1] df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2] #The following line gives you predicted values in a row, given the PRIOR row estimated parameters df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X'] 

I keep the constant and beta versions, but there are several ways to get closer to the prediction ... you can use your installed mine RollOLS model object and the .predict() method, or multiply it by yourself, which I did in the final line (it’s easier to do this in this case, because the number of variables is fixed and known, and you can do simple math columns at a time).

make predictions using sm, although along the way it will look like this:

 predict_x=np.random.normal(size=(20,2)) RollOLS.predict(sm.add_constant(predict_x)) 

but keep in mind that if you sequentially execute the above code, the predicted values ​​will only use the model of the last window. if you want to use a different model, you can save them as you go or predict the values ​​in a for loop. Note that you can also get the set values ​​using RollOLS.fittedvalues , and therefore, if you smooth out the output and save RollOLS.fittedvalues[-1] for each iteration in a loop.


To find out how to use for my own data, here is the tail of my df after starting the regression loop:

  time XY a b1 b2 495 0.662463 0.771971 0.643008 -0.0235751 0.037875 0.0907694 496 -0.127879 1.293141 0.404959 0.00314073 0.0441054 0.113387 497 -0.006581 -0.824247 0.226653 0.0105847 0.0439867 0.118228 498 1.870858 0.920964 0.571535 0.0123463 0.0428359 0.11598 499 0.724296 0.537296 -0.411965 0.00104044 0.055003 0.118953 
+2
source

All Articles