Use Scikit. Learn to perform linear regression in the pandas time series.

Question

Use Scikit. Learn to perform linear regression in the pandas time series.

I am trying to do a simple linear regression on a pandas data frame using scikit learn linear regressor. My data is a time series, and the pandas data frame has a datetime index:

value 2007-01-01 0.771305 2007-02-01 0.256628 2008-01-01 0.670920 2008-02-01 0.098047

Doing something as simple as

 from sklearn import linear_model lr = linear_model.LinearRegression() lr(data.index, data['value'])

does not work:

 float() argument must be a string or a number

So, I tried to create a new date column to try to convert it:

 data['date'] = data.index data['date'] = pd.to_datetime(data['date']) lr(data['date'], data['value'])

but now i get:

 ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Therefore, regression cannot process the date and time. I saw many ways to convert integer data to datetime, but could not find a way to convert from datetime to integer, for example.

What is the right way to do this?

PS: I am interested in using scikit because I plan to do more things with it, so there are no statistical models yet.

+7

python pandas

Ivan Apr 20 '15 at 13:05

source share

1 answer

Tomugspurger · Accepted Answer · 2015-04-20T13:12:50+0000

You probably need something like the number of days since the start of your predictor. Assuming everything is sorted:

 In [36]: X = (df.index - df.index[0]).days.reshape(-1, 1) In [37]: y = df['value'].values In [38]: linear_model.LinearRegression().fit(X, y) Out[38]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The exact units that you use for the predictor do not really matter, it can be days or months. Odds and interpretation will change so that everything works with the same result. Also note that we need reshape(-1, 1) so that X in the expected format.

Use Scikit. Learn to perform linear regression in the pandas time series.

More articles: