Scikit-learn undersampling from unbalanced data for cross-validation

How to create random folds for cross validation in scikit-learn?

Imagine that we have 20 samples of one class and 80 others, and we need to generate N sets of trains and tests, each set of trains of size 30, provided that in each set of training we have 50% of the first and 50% of class 2 .

I found this discussion ( https://github.com/scikit-learn/scikit-learn/issues/1362 ), but I do not understand how to get the folds. Ideally, I think I need a function like this:

cfolds = np.cross_validation.imaginaryfunction( [list(itertools.repeat(1,20)), list(itertools.repeat(2,80))], n_iter=100, test_size=0.70) 

What am I missing?

+6
source share
2 answers

There is no direct way to do cross validation with unallocated discretization in scikit, but there are two workarounds:

1.

Use StratifiedCrossValidation to achieve cross-validation with the distribution in each roll-up, reflecting the distribution of data, then you can reduce the imbalance in the classifiers using the class_weight parameter, which can either take auto or defective / over-fulfilled classes inversely to their score, or you can pass dictionary with explicit weight.

2.

Write your own cross-reference validation procedure, which should be fairly simple using pandas .

+1
source

StratifiedCV is a good choice, but you can make it simpler:

  • Perform random sampling of data related to class 1 (you need to select 15/20 samples)
  • Same for class 2 (15/80)
  • Repeat 100 times or as much as you need.

It's all. Fast and efficient!

0
source

All Articles