How to create a custom cross-validation generator in scikit-learn?

I have an unbalanced dataset, so I have a resampling strategy that I only apply when training my data. I would like to use scikit-learn classes such as GridSearchCV or cross_val_score to examine or cross-check some parameters in my evaluation (e.g. SVC). However, I see that you are either passing the number of cv folds or the standard cross-validation generator.

I would like to create my own cv generator in order to receive and stratify 5 times and reprogram only my training data (4 times) and allow scikit-learn to view the grid of my assessment parameters and evaluate using the remaining fold for verification.

Thanks in advance.

+10
source share
4 answers

The cross-validation generator returns the iterable length n_folds , each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of test and training sets for this cross-validation,

So, for a 10x cross-validation, your custom cross-validation generator should contain 10 elements, each of which contains a tuple with two elements:

  • An array of indexes for a subset of the training for this run, covering 90% of your data.
  • An array of indices for a subset of the tests for this run, covering 10% of the data

I worked on a similar problem in which I created whole labels for different folds of my data. My dataset is stored in the Pandas dataframe myDf , which has a cvLabel column for cross-validation shortcuts. I will build a custom cross-validation myCViterator as follows:

 myCViterator = [] for i in range(nFolds): trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int) testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int) myCViterator.append( (trainIndices, testIndices) ) 
+12
source

Scikit-Learn provides a workaround for this, with their k-fold Label iterator:

LabelKFold is a variation of k-fold that ensures that the same label is not in the testing and training sets. This is necessary, for example, if you received data from different subjects and want to avoid excessive (for example, the characteristics of a particular person) through testing and training in various subjects.

To use this iterator in case of oversampling, firstly, you can create a column in your data frame (e.g. cv_label ) that stores the index values ​​of each row.

 df['cv_label'] = df.index 

You can then apply your oversampling, making sure you copy the cv_label column in oversampling. This column will contain duplicate values ​​for oversampling data. You can create a separate series or a list of these labels for further processing:

 cv_labels = df['cv_label'] 

Keep in mind that you will need to remove this column from your data frame before running your cross-validator / classifier.

After dividing your data into functionalities (not including cv_label ) and shortcuts, you create the LabelKFold iterator and run the cross-validation function that you need:

 clf = svm.SVC(C=1) lkf = LabelKFold(cv_labels, n_folds=5) predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf) 
+8
source

I had a similar problem and this quick hack works for me:

 class UpsampleStratifiedKFold: def __init__(self, n_splits=3): self.n_splits = n_splits def split(self, X, y, groups=None): for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y): nix = np.where(y[rx]==0)[0] pix = np.where(y[rx]==1)[0] pixu = np.random.choice(pix, size=nix.shape[0], replace=True) ix = np.append(nix, pixu) rxm = rx[ix] yield rxm, tx def get_n_splits(self, X, y, groups=None): return self.n_splits 

This raises (with replacement) the sample class for a balanced (k-1) -fold training set, but leaves the k test set unbalanced. This seems to work well with sklearn.model_selection.GridSearchCV and other similar classes that require a CV generator.

+5
source
 class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py def init(self):#coordinates,meter pass # self.coordinates = coordinates # self.meter = meter def split(self,X,y=None,groups=None): #for compatibility with #cross_val_predict,cross_val_score for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X)))) 
-1
source

All Articles