Sklearn: Custom Cross Validation for Time Series Data

I am trying to solve the problem of machine learning. I have a specific dataset with a time series element. For this problem, I use the famous python library - sklearn . There are many cross validation iterators in this library. There are also several iterators for defining cross validation. The problem is that I really don't know how to define a simple cross-validation for time series. Here is a good example of what I'm trying to get:

Suppose we have several periods (years), and we want to split our data set into several fragments as follows:

 data = [1, 2, 3, 4, 5, 6, 7] train: [1] test: [2] (or test: [2, 3, 4, 5, 6, 7]) train: [1, 2] test: [3] (or test: [3, 4, 5, 6, 7]) train: [1, 2, 3] test: [4] (or test: [4, 5, 6, 7]) ... train: [1, 2, 3, 4, 5, 6] test: [7] 

I cannot figure out how to create such a cross validation using the sklearn tools. I should PredefinedSplit use the PredefinedSplit from sklearn.cross_validation as follows:

 train_fraction = 0.8 train_size = int(train_fraction * X_train.shape[0]) validation_size = X_train.shape[0] - train_size cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size) 

Result:

 train: [1, 2, 3, 4, 5] test: [6, 7] 

But still, it’s not as good as the previous data split.

+8
python scikit-learn cross-validation
source share
2 answers

You can get the required cross-check splits without using sklearn . Here is an example

 import numpy as np from sklearn.svm import SVR from sklearn.feature_selection import RFECV # Generate some data. N = 10 X_train = np.random.randn(N, 3) y_train = np.random.randn(N) # Define the splits. idxs = np.arange(N) cv_splits = [(idxs[:i], idxs[i:]) for i in range(1, N)] # Create the RFE object and compute a cross-validated score. svr = SVR(kernel="linear") rfecv = RFECV(estimator=svr, step=1, cv=cv_splits) rfecv.fit(X_train, y_train) 
+5
source share

Meanwhile, this was added to the library: http://scikit-learn.org/stable/modules/cross_validation.html#time-series-split

Example from the document:

 >>> from sklearn.model_selection import TimeSeriesSplit >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4, 5, 6]) >>> tscv = TimeSeriesSplit(n_splits=3) >>> print(tscv) TimeSeriesSplit(n_splits=3) >>> for train, test in tscv.split(X): ... print("%s %s" % (train, test)) [0 1 2] [3] [0 1 2 3] [4] [0 1 2 3 4] [5] 
+3
source share

All Articles