Sklearn LogisticRegression and changing the default threshold for classification

I am using LogisticRegression from the sklearn package and ask a quick classification question. I built a ROC curve for my classifier, and it turned out that the optimal threshold for my training data is about 0.25. I assume that the default threshold for creating forecasts is 0.5. How can I change this default setting to find out what accuracy is in my model when doing 10x cross validation? Basically, I want my model to predict β€œ1” for someone larger than 0.25, not 0.5. I was looking through all the documentation and it looks like I can't find anything.

Thanks in advance for your help.

+13
source share
4 answers

This is not a built-in feature. You can β€œadd” it by wrapping the LogisticRegression class in your own class and adding the threshold attribute that you use inside the predict() user method.

However, some warnings:

  • The default threshold is actually 0. LogisticRegression.decision_function() returns the sign distance to the selected split hyperplane. If you look at predict_proba() , then you look at logit() of the hyperplane distance with a threshold of 0.5. But it is more expensive to calculate.
  • By choosing an β€œoptimal” threshold like this, you use post-training information that spoils your test suite (that is, your test or test suite no longer provides an objective estimate of the error outside the sample). Therefore, you can cause an additional reset, unless you select a threshold within the cross-validation cycle only on your training set, and then use it and a trained classifier with your test set.
  • Consider using class_weight if you have an unbalanced problem, rather than manually setting a threshold. This should make the classifier choose a hyperplane farther from the class of serious interest.
+6
source

I would like to give a practical answer

 from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score X, y = make_classification( n_classes=2, class_sep=1.5, weights=[0.9, 0.1], n_features=20, n_samples=1000, random_state=10 ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) clf = LogisticRegression(class_weight="balanced") clf.fit(X_train, y_train) THRESHOLD = 0.25 preds = np.where(clf.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0) pd.DataFrame(data=[accuracy_score(y_test, preds), recall_score(y_test, preds), precision_score(y_test, preds), roc_auc_score(y_test, preds)], index=["accuracy", "recall", "precision", "roc_auc_score"]) 

By changing THRESHOLD to 0.25 , you may find that recall and precision are decreasing. However, when you remove the argument, class_weight accuracy increases, and the recall indicator falls. See @accepted answer

+8
source

Special case: one-dimensional logistic regression

The value separating the regions where sample X marked as 1 and where it is marked as 0 is calculated by the formula:

 from scipy.special import logit thresh = 0.1 val = (logit(thresh)-clf.intercept_)/clf.coef_[0] 

Thus, forecasts can be calculated more directly from

 preds = np.where(X>val, 1, 0) 
0
source

For the sake of completeness, I would like to mention another way to elegantly generate forecasts based on calculations of skikit probability using binarization :

 import numpy as np from sklearn.preprocessing import binarize THRESHOLD = 0.25 # This probabilities would come from logistic_regression.predict_proba() y_logistic_prob = np.random.uniform(size=10) predictions = binarize(y_logistic_prob.reshape(-1, 1), THRESHOLD).ravel() 

In addition, I agree with the considerations that Andreus expresses , especially 2 and 3. Be sure to follow them.

0
source

All Articles