Scikit-learn undersampling from unbalanced data for cross-validation

Question

Scikit-learn undersampling from unbalanced data for cross-validation

How to create random folds for cross validation in scikit-learn?

Imagine that we have 20 samples of one class and 80 others, and we need to generate N sets of trains and tests, each set of trains of size 30, provided that in each set of training we have 50% of the first and 50% of class 2 .

I found this discussion ( https://github.com/scikit-learn/scikit-learn/issues/1362 ), but I do not understand how to get the folds. Ideally, I think I need a function like this:

cfolds = np.cross_validation.imaginaryfunction( [list(itertools.repeat(1,20)), list(itertools.repeat(2,80))], n_iter=100, test_size=0.70)

What am I missing?

+6

python scikit-learn machine-learning

adrin Dec 20 '13 at 21:00

source share

2 answers

Johnnym · Answer 1 · 2013-12-29T12:17:18+0000

There is no direct way to do cross validation with unallocated discretization in scikit, but there are two workarounds:

1.

Use StratifiedCrossValidation to achieve cross-validation with the distribution in each roll-up, reflecting the distribution of data, then you can reduce the imbalance in the classifiers using the class_weight parameter, which can either take auto or defective / over-fulfilled classes inversely to their score, or you can pass dictionary with explicit weight.

2.

Write your own cross-reference validation procedure, which should be fairly simple using pandas .

avchauzov · Answer 2 · 2018-01-10T10:43:52+0000

StratifiedCV is a good choice, but you can make it simpler:

Perform random sampling of data related to class 1 (you need to select 15/20 samples)
Same for class 2 (15/80)
Repeat 100 times or as much as you need.

It's all. Fast and efficient!

Scikit-learn undersampling from unbalanced data for cross-validation

More articles: