Lack of intercepts of OLS regression models in Python static models

I start casting, for example, from 100 OLS regression estimation windows of the dataset found in this link ( https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk ), as in the following format.

  time XY 0.000543 0 10 0.000575 0 10 0.041324 1 10 0.041331 2 10 0.041336 3 10 0.04134 4 10 ... 9.987735 55 239 9.987739 56 239 9.987744 57 239 9.987749 58 239 9.987938 59 239 

The third column (Y) in my dataset is my true value - this is what I wanted to predict (evaluate). I want to make a prediction of Y (i.e., Predict the current value of Y according to the previous 3 values ​​of the X ride. For this, I have the following python script work using statsmodels .

 # /usr/bin/python -tt import pandas as pd import numpy as np import statsmodels.api as sm df=pd.read_csv('estimated_pred.csv') df=df.dropna() # to drop nans in case there are any window = 100 #print(df.index) # to print index df['a']=None #constant df['b1']=None #beta1 df['b2']=None #beta2 for i in range(window,len(df)): temp=df.iloc[i-window:i,:] RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit() df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0] df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1] df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2] # Predicted values in a row df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X'] #print(df['predicted']) print(temp) 

Which gives me sample output in the following format.

  time XY a b1 b2 predicted 0 0.000543 0 10 None None None NaN 1 0.000575 0 10 None None None NaN 2 0.041324 1 10 None None None NaN 3 0.041331 2 10 None None None NaN 4 0.041336 3 10 None None None NaN .. ... .. .. ... ... ... ... 50 0.041340 4 10 10 0 1.55431e-15 NaN 51 0.041345 5 10 10 1.7053e-13 7.77156e-16 10 52 0.041350 6 10 10 1.74623e-09 -7.99361e-15 10 53 0.041354 7 10 10 6.98492e-10 -6.21725e-15 10 .. ... .. .. ... ... ... ... 509 0.160835 38 20 20 4.88944e-09 -1.15463e-14 20 510 0.160839 39 20 20 1.86265e-09 5.32907e-15 20 .. ... .. .. ... ... ... ... 

Finally, I want to include the root mean square error ( MSE ) for the entire forecast ( OLS regression analysis summary). For example, if we look at line 5, the value of X is 2, and the value of Y is 10. Let's say the prediction value Y in the current line is 6, and so the MSE will be (10-6)^2 . sm.OLS returns an instance of this class <class 'statsmodels.regression.linear_model.OLS'> when we do print (RollOLS.summary()) .

 OLS Regression Results ============================================================================== Dep. Variable: Y R-squared: -inf Model: OLS Adj. R-squared: -inf Method: Least Squares F-statistic: -48.50 Date: Tue, 04 Jul 2017 Prob (F-statistic): 1.00 Time: 22:19:18 Log-Likelihood: 2359.7 No. Observations: 100 AIC: -4713. Df Residuals: 97 BIC: -4706. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 239.0000 2.58e-09 9.26e+10 0.000 239.000 239.000 time 4.547e-13 2.58e-10 0.002 0.999 -5.12e-10 5.13e-10 X -3.886e-16 1.1e-13 -0.004 0.997 -2.19e-13 2.19e-13 ============================================================================== Omnibus: 44.322 Durbin-Watson: 0.000 Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.471 Skew: -1.886 Prob(JB): 1.67e-19 Kurtosis: 5.556 Cond. No. 9.72e+04 ============================================================================== 

But the value of rsquared ( print (RollOLS.rsquared)) , for example, should be between 0 and 1 instead of -inf , and this seems to be a problem with missing intercepts . If we want to print MSE , we do print (RollOLS.mse_model) ... etc. in accordance with the documentation . How to add intercepts and print regression statistics with the correct values, as for the predicted values? What am I doing wrong here? Or is there another way to do this using scikit-learn libraries?

0
python numpy scikit-learn statsmodels
Jul 05 '17 at 10:34 on
source share
2 answers

Short answer

The value r^2 will be +/- inf if y remains unchanged in the regression window (100 observations in your case). You can find more detailed information below, but the intuition is that r^2 is the fraction of the variance of y , explained by X : if y variance of zero, r^2 simply not defined.

Possible solution: try using a longer window or resample Y and X so that Y does not remain constant for many consecutive observations.

Long answer

Looking at this, I honestly believe that this is the wrong data set for regression. This is a simple data graph:

enter image description here

Does a linear combination of X and time explain Y? Mmm ... doesn't look believable. Y is almost like a discrete variable, so you probably want to see logistic regressions .

To come to your question, R ^ 2 is "the proportion of variance in the dependent variable, which is predictable from the independent variable (s)." From wikipedia :

enter image description here

In your case, it is very likely that Y is a constant of more than 100 observations, therefore, it has a 0-dispersion, which produces division by zero, therefore, inf.

Therefore, I am afraid that you should not look for corrections in the code, but you should rethink the problem and the way of selecting data.

+1
Jul 05 '17 at 11:50
source share

So, I prepared this small example so that you can visualize what Poisson regression can do.

 import statsmodels as sm import matplotlib.pyplot as plt poi_model = sm.discrete.discrete_model.Poisson x = np.random.uniform(0, 20,1000) s = np.random.poisson( x*(0.5) , 1000) plt.bar(x,s) plt.show() 

This generates random poisson counts.

Now the way to bind Poisson regression to data is as follows:

 my_model = poi_model(endog=s, exog=x) my_model = my_model.fit() my_model.summary() 

The summary displays a number of statistics, but if you want to calculate the mean square error, you can do this:

 preds = my_model.predict() mse = np.mean(np.square(preds - s)) 

If you want to predict new values, do the following:

 my_model.predict(exog=new_value) 
0
Jul 05 '17 at 18:01
source share



All Articles