Trying to balance my dataset with sample_weight in scikit-learn

I use RandomForest for classification, and I got an unbalanced dataset, since: 5830-no, 1006-yes. I am trying to balance my dataset with class_weight and sample_weight, but I cannot.

My code is:

X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25) cw='auto' clf=RandomForestClassifier(class_weight=cw) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} sw = np.array([1 if i == 0 else 8 for i in y_train]) CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw}) 

But I do not improve my relations TPR, FPR, ROC when using class_weight and sample_weight.

Why? Am I doing something wrong?

However, if I use a function called balanced_subsample, my ratios get a big improvement:

 def balanced_subsample(x,y,subsample_size): class_xs = [] min_elems = None for yi in np.unique(y): elems = x[(y == yi)] class_xs.append((yi, elems)) if min_elems == None or elems.shape[0] < min_elems: min_elems = elems.shape[0] use_elems = min_elems if subsample_size < 1: use_elems = int(min_elems*subsample_size) xs = [] ys = [] for ci,this_xs in class_xs: if len(this_xs) > use_elems: np.random.shuffle(this_xs) x_ = this_xs[:use_elems] y_ = np.empty(use_elems) y_.fill(ci) xs.append(x_) ys.append(y_) xs = np.concatenate(xs) ys = np.concatenate(ys) return xs,ys 

My new code is:

 X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5) X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25) cw='auto' clf=RandomForestClassifier(class_weight=cw) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} sw = np.array([1 if i == 0 else 8 for i in y_train]) CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw}) 

thanks

+4
source share
2 answers

This is not a complete answer, but I hope it helps to get there.

First few comments:
  • To debug this problem, it is often useful to have deterministic behavior. You can pass the random_state attribute to RandomForestClassifier and various scikit-learn objects that have an inherent randomness to get the same result every time you run. You will also need:

     import numpy as np np.random.seed() import random random.seed() 

for your balanced_subsample function, behave the same in every run.

  • Don't do a grid search on n_estimators : more trees are always better in a random forest.
  • Note that sample_weight and class_weight have a similar purpose: the actual sample weights will be sample_weight * weights derived from class_weight .

Could you try:

  • Using subsampling = 1 in your balanced_subsample function. If there is no particular reason not to do this, we better compare the results with a similar number of samples.
  • Using your subsampling strategy with class_weight and sample_weight , both are set to None.

EDIT : Read your comment again. I understand that your results are not so surprising! You get better (higher) TPR, but worse (higher) FPR .
It just means that your classifier is trying to get the samples from class 1 to the right and thus makes more false positives (at the same time getting more rights, of course!). This trend will continue if you continue to increase the weight of the class / sample in the same direction.

+1
source

There is an uneven learning API that helps with oversampling / flaw data that may be useful in this situation. You can transfer your training set to one of the methods, and it will give you data with oversampling. See a simple example below.

 from imblearn.over_sampling import RandomOverSampler ros = RandomOverSampler(random_state=1) x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data) 

Here is the API link: http://contrib.scikit-learn.org/imbalanced-learn/api.html

Hope this helps!

+1
source

All Articles