Using scikit Random Forest sample_weights

I was trying to figure out how to use the random scikit sample_weight forest, and I cannot explain some of the results that I see. Essentially, I need this to balance the classification problem with unbalanced classes.

In particular, I expected that if I used the sample_weights array for all 1, I would get the same result as w sample_weights=None . In addition, I explained that any array with the same weights (i.e. All 1 or all 10 or all 0.8 ...) will give the same result. Perhaps my intuition of the scales in this case is incorrect.

Here is the code:

 import numpy as np from sklearn import ensemble,metrics, cross_validation, datasets #create a synthetic dataset with unbalanced classes X,y = datasets.make_classification( n_samples=10000, n_features=20, n_informative=4, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=[0.9], flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=0) model = ensemble.RandomForestClassifier() w0=1 #weight associated to 0's w1=1 #weight associated to 1's #I should split train and validation but for the sake of understanding sample_weights I'll skip this step model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y])) preds = model.predict(X) probas = model.predict_proba(X) ACC = metrics.accuracy_score(y,preds) precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1]) fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1]) ROC = metrics.auc(fpr, tpr) cm = metrics.confusion_matrix(y,preds) print "ACCURACY:", ACC print "ROC:", ROC print "F1 Score:", metrics.f1_score(y,preds) print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0) print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0) print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1) print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1) 
  • With w0=w1=1 I get, for example, F1=0.9456 .
  • With w0=w1=10 I get, for example, F1=0.9569 .
  • With sample_weights=None I get F1=0.9474 .
+7
python scikit-learn random-forest weight
source share
1 answer

When using the Random Forest algorithm, there are, as the name suggests, some "random" ones for it.

You get a different F1 score because the Random Forest Algorithm (RFA) uses a subset of your data to generate decision trees, and then averages over all of your trees. Therefore, I am not surprised that you have similar (but not identical) F1 scores for each of your runs.

I tried balancing the scales before. You can try balancing weights according to the size of each class in the population. For example, if you must have two classes:

 Class A: 5 members Class B: 2 members 

You can balance the scales by assigning 2/7 for each Class A member and 5/7 for each Class B member. However, this is just an idea as a starting place. How you weigh your classes will depend on your problem.

+7
source share

All Articles