Pandas Time Series Linear Regression

Question

Pandas Time Series Linear Regression

I would like to get a regression with a time series as a predictor, and I am trying to execute the answer by giving an answer to this SO ( OLS with pandas: datetime index as a predictor ), but this no longer works, as far as I know.

Am I missing something or is there a new way to do this?

import pandas as pd rng = pd.date_range('1/1/2011', periods=4, freq='H') s = pd.Series(range(4), index = rng) z = s.reset_index() pd.ols(x=z["index"], y=z[0])

I get this error. The error is explanatory, but I wonder what I am missing in redefining a previously working solution.

TypeError: cannot asterize datetimelike from [datetime64 [ns]] to [float64]

+5

python pandas

canyon289 May 24, '15 at 15:59

source share

1 answer

John · Accepted Answer · 2015-05-25T05:47:38+0000

I'm not sure why pd.ols so picky (it seems to me that you followed the example correctly). I suspect this is due to changes in the way pandas processes or stores datetime indexes, but is too lazy to explore further. In any case, since your datetime variable is only different for an hour, you can simply extract the hour using dt accessor:

 pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])

However, this gives you an r-square of 1, since your model is redefined to include interception (and y is a linear function of x). You can change range to np.random.randn and then get something similar to the results of a regular regression.

 In [6]: z = pd.Series(np.random.randn(4), index = rng).reset_index() pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0]) Out[6]: -------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 4 Number of Degrees of Freedom: 2 R-squared: 0.7743 Adj R-squared: 0.6615 Rmse: 0.5156 F-stat (1, 2): 6.8626, p-value: 0.1200 Degrees of Freedom: model 1, resid 2 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.6040 0.2306 -2.62 0.1200 -1.0560 -0.1521 intercept 0.2915 0.4314 0.68 0.5689 -0.5540 1.1370 ---------------------------------End of Summary---------------------------------

Alternatively, you can convert the index to an integer, although I found that this doesn't work very well (I assume that integers are nanoseconds from an era or something like that and therefore are very large and cause release accuracy ), but converting to an integer and dividing by a trillion or so really worked and gave essentially the same results as using dt.hour (i.e. the same r-square):

 pd.ols(x=pd.to_datetime(z["index"]).astype(int)/1e12, y=z[0])

Error Message Source

FWIW, it looks like this error message comes from the following:

 pd.to_datetime(z["index"]).astype(float)

Although a fairly obvious workaround is the following:

 pd.to_datetime(z["index"]).astype(int).astype(float)

Pandas Time Series Linear Regression

More articles: