I use RandomForest for classification, and I got an unbalanced dataset, since: 5830-no, 1006-yes. I am trying to balance my dataset with class_weight and sample_weight, but I cannot.
My code is:
X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25) cw='auto' clf=RandomForestClassifier(class_weight=cw) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} sw = np.array([1 if i == 0 else 8 for i in y_train]) CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
But I do not improve my relations TPR, FPR, ROC when using class_weight and sample_weight.
Why? Am I doing something wrong?
However, if I use a function called balanced_subsample, my ratios get a big improvement:
def balanced_subsample(x,y,subsample_size): class_xs = [] min_elems = None for yi in np.unique(y): elems = x[(y == yi)] class_xs.append((yi, elems)) if min_elems == None or elems.shape[0] < min_elems: min_elems = elems.shape[0] use_elems = min_elems if subsample_size < 1: use_elems = int(min_elems*subsample_size) xs = [] ys = [] for ci,this_xs in class_xs: if len(this_xs) > use_elems: np.random.shuffle(this_xs) x_ = this_xs[:use_elems] y_ = np.empty(use_elems) y_.fill(ci) xs.append(x_) ys.append(y_) xs = np.concatenate(xs) ys = np.concatenate(ys) return xs,ys
My new code is:
X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5) X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25) cw='auto' clf=RandomForestClassifier(class_weight=cw) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} sw = np.array([1 if i == 0 else 8 for i in y_train]) CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
thanks