Use Scikit. Learn to perform linear regression in the pandas time series.

I am trying to do a simple linear regression on a pandas data frame using scikit learn linear regressor. My data is a time series, and the pandas data frame has a datetime index:

value 2007-01-01 0.771305 2007-02-01 0.256628 2008-01-01 0.670920 2008-02-01 0.098047 

Doing something as simple as

 from sklearn import linear_model lr = linear_model.LinearRegression() lr(data.index, data['value']) 

does not work:

 float() argument must be a string or a number 

So, I tried to create a new date column to try to convert it:

 data['date'] = data.index data['date'] = pd.to_datetime(data['date']) lr(data['date'], data['value']) 

but now i get:

 ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). 

Therefore, regression cannot process the date and time. I saw many ways to convert integer data to datetime, but could not find a way to convert from datetime to integer, for example.

What is the right way to do this?

PS: I am interested in using scikit because I plan to do more things with it, so there are no statistical models yet.

+7
python pandas
source share
1 answer

You probably need something like the number of days since the start of your predictor. Assuming everything is sorted:

 In [36]: X = (df.index - df.index[0]).days.reshape(-1, 1) In [37]: y = df['value'].values In [38]: linear_model.LinearRegression().fit(X, y) Out[38]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 

The exact units that you use for the predictor do not really matter, it can be days or months. Odds and interpretation will change so that everything works with the same result. Also note that we need reshape(-1, 1) so that X in the expected format.

+10
source share

All Articles