Scikit-Learn provides a workaround for this, with their k-fold Label iterator:
LabelKFold is a variation of k-fold that ensures that the same label is not in the testing and training sets. This is necessary, for example, if you received data from different subjects and want to avoid excessive (for example, the characteristics of a particular person) through testing and training in various subjects.
To use this iterator in case of oversampling, firstly, you can create a column in your data frame (e.g. cv_label ) that stores the index values ββof each row.
df['cv_label'] = df.index
You can then apply your oversampling, making sure you copy the cv_label column in oversampling. This column will contain duplicate values ββfor oversampling data. You can create a separate series or a list of these labels for further processing:
cv_labels = df['cv_label']
Keep in mind that you will need to remove this column from your data frame before running your cross-validator / classifier.
After dividing your data into functionalities (not including cv_label ) and shortcuts, you create the LabelKFold iterator and run the cross-validation function that you need:
clf = svm.SVC(C=1) lkf = LabelKFold(cv_labels, n_folds=5) predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)
source share