Scikit - changing the threshold to create multiple confusion matrices

Question

Scikit - changing the threshold to create multiple confusion matrices

I create a classifier that goes through the club club data and selects the best X loans. I trained a random forest and created the usual ROC curves, confusion matrices, etc.

The matrix of confusion takes as an argument the predictions of the classifier (most forecasts of trees in the forest). However, I want to print several matrices of confusion at different thresholds to find out what happens if I choose 10% of the best loans, 20% of the best loans, etc.

I know from reading other questions that changing the threshold is often a bad idea , but is there any other way to see the matrix of confusion for these situations? (question A)

If I continue to change the threshold, I must assume that the best way to do this is to predict proba , and then the threshold value manually, passing this to the Confusion Matrix? (question B)

+6

scikit-learn classification random-forest threshold confusion-matrix

sapo_cosmico Sep 17 '15 at 10:18

source share

1 answer

David dale · Answer 1 · 2017-11-24T13:12:28+0000

A. In your case, changing the threshold is permissible and perhaps even necessary. The default threshold is 50%, but from a business point of view, even a 15% probability of non-return may be sufficient to reject such an application.

In fact, credit scoring usually sets different cutoffs for different conditions of the product or customer segments, after predicting the probability of default with the general model (see, for example, chapter 9 of “Naeem Siddiqi Credit Risk Assessment Indicators”).

B. There are two convenient ways to threshold for arbitrary alpha instead of 50%:

Indeed, predict_proba and the alpha threshold value manually or with a wrapper class (see code below). Use this if you want to try multiple thresholds without setting up a model.
Before installing the model, change class_weights to (alpha, 1-alpha) .

And now the sample code for the shell:

 import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import make_pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.base import BaseEstimator, ClassifierMixin X, y = make_classification(random_state=1) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) class CustomThreshold(BaseEstimator, ClassifierMixin): """ Custom threshold wrapper for binary classification""" def __init__(self, base, threshold=0.5): self.base = base self.threshold = threshold def fit(self, *args, **kwargs): self.base.fit(*args, **kwargs) return self def predict(self, X): return (self.base.predict_proba(X)[:, 1] > self.threshold).astype(int) rf = RandomForestClassifier(random_state=1).fit(X_train, y_train) clf = [CustomThreshold(rf, threshold) for threshold in [0.3, 0.5, 0.7]] for model in clf: print(confusion_matrix(y_test, model.predict(X_test))) assert((clf[1].predict(X_test) == clf[1].base.predict(X_test)).all()) assert(sum(clf[0].predict(X_test)) > sum(clf[0].base.predict(X_test))) assert(sum(clf[2].predict(X_test)) < sum(clf[2].base.predict(X_test)))

It will output 3 confusion matrices for different threshold values:

 [[13 1] [ 2 9]] [[14 0] [ 3 8]] [[14 0] [ 4 7]]

Scikit - changing the threshold to create multiple confusion matrices

More articles: