I am trying to solve the problem of machine learning. I have a specific dataset with a time series element. For this problem, I use the famous python library - sklearn . There are many cross validation iterators in this library. There are also several iterators for defining cross validation. The problem is that I really don't know how to define a simple cross-validation for time series. Here is a good example of what I'm trying to get:
Suppose we have several periods (years), and we want to split our data set into several fragments as follows:
data = [1, 2, 3, 4, 5, 6, 7] train: [1] test: [2] (or test: [2, 3, 4, 5, 6, 7]) train: [1, 2] test: [3] (or test: [3, 4, 5, 6, 7]) train: [1, 2, 3] test: [4] (or test: [4, 5, 6, 7]) ... train: [1, 2, 3, 4, 5, 6] test: [7]
I cannot figure out how to create such a cross validation using the sklearn tools. I should PredefinedSplit use the PredefinedSplit from sklearn.cross_validation as follows:
train_fraction = 0.8 train_size = int(train_fraction * X_train.shape[0]) validation_size = X_train.shape[0] - train_size cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size)
Result:
train: [1, 2, 3, 4, 5] test: [6, 7]
But still, itβs not as good as the previous data split.
python scikit-learn cross-validation
Demyanov
source share