TL DR
The way you use predict
returns a linear forecast in terms of a differentiated endogenous variable, not a forecast of the levels of the original endogenous variable .
To change this behavior, you must predict
the typ='levels'
method with typ='levels'
:
preds = fit.predict(1, 30, typ='levels')
See the documentation for ARIMAResults.predict
for ARIMAResults.predict
.
Step by step
Dataset
We load the data that you specified in MCVE:
raw = io.StringIO("""date count 2001-11-01 0.998543 2001-11-02 1.914526 2001-11-03 3.057407 2001-11-04 4.044301 2001-11-05 4.952441 2001-11-06 6.002932 2001-11-07 6.930134 2001-11-08 8.011137 2001-11-09 9.040393 2001-11-10 10.097007 2001-11-11 11.063742 2001-11-12 12.051951 2001-11-13 13.062637 2001-11-14 14.086016 2001-11-15 15.096826 2001-11-16 15.944886 2001-11-17 17.027107 2001-11-18 17.930240 2001-11-19 18.984202 2001-11-20 19.971603""") data = pd.read_fwf(raw, parse_dates=['date'], index_col='date')
As expected, data is automatically correlated:
from pandas.plotting import autocorrelation_plot autocorrelation_plot(data)
Model and training
We create an ARIMA
model object for a given installation (P,D,Q)
and train it on our data using the fit
method:
order = (2, 1, 2) model = ARIMA(data, order, freq='D') fit = model.fit()
It returns an ARIMAResults
object that is of interest. We can check the quality of our model:
fit.summary() ARIMA Model Results ============================================================================== Dep. Variable: D.count No. Observations: 19 Model: ARIMA(2, 1, 2) Log Likelihood 25.395 Method: css-mle SD of innovations 0.059 Date: Fri, 18 Jan 2019 AIC -38.790 Time: 07:54:36 BIC -33.123 Sample: 11-02-2001 HQIC -37.831 - 11-20-2001 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const 1.0001 0.014 73.731 0.000 0.973 1.027 ar.L1.D.count -0.3971 0.295 -1.346 0.200 -0.975 0.181 ar.L2.D.count -0.6571 0.230 -2.851 0.013 -1.109 -0.205 ma.L1.D.count 0.0892 0.208 0.429 0.674 -0.318 0.496 ma.L2.D.count 1.0000 0.640 1.563 0.140 -0.254 2.254 Roots ============================================================================== Real Imaginary Modulus Frequency ------------------------------------------------------------------------------ AR.1 -0.3022 -1.1961j 1.2336 -0.2894 AR.2 -0.3022 +1.1961j 1.2336 0.2894 MA.1 -0.0446 -0.9990j 1.0000 -0.2571 MA.2 -0.0446 +0.9990j 1.0000 0.2571 ------------------------------------------------------------------------------
And we can roughly estimate how the residues are distributed:
residuals = pd.DataFrame(fit.resid, columns=['residuals']) axe = residuals.plot(kind='kde')
forecasting
If we are satisfied with our model, then we can predict some data in the sample or in the sample.
This can be done using the predict
method predict
which by default returns a differentiated endogenous variable, rather than the endogenous variable itself . To change this behavior, we must specify typ='levels'
:
preds = fit.predict(1, 30, typ='levels') preds
Then our forecasts have the same levels of our training data:
In addition, if we are interested in having confidence intervals, we can use the forecast
method.
String argument
It is also possible to pass predict
strings (always use the ISO-8601 format if you want to avoid problems) or datetime
objects:
preds = fit.predict("2001-11-02", "2001-12-15", typ='levels')
It works as expected on StatsModels 0.9.0:
import statsmodels as sm sm.__version__