Scikit-learn confusion matrix

I can’t understand if I corrected the binary classification problem correctly. I have designated positive class 1 and negative 0. However, I understand that by default scikit-learn uses class 0 as a positive class in its confusion matrix (so the opposite is the way I configured it). It bothers me. Is the top line, in the default setting of scikit-learn, positive or negative? Suppose the output of the confusion matrix is:

confusion_matrix(y_test, preds) [ [30 5] [2 42] ] 

How would it look in the matrix of confusion? Are the actual row or column instances in scikit-learn?

  prediction prediction 0 1 1 0 ----- ----- ----- ----- 0 | TN | FP (OR) 1 | TP | FP actual ----- ----- actual ----- ----- 1 | FN | TP 0 | FN | TN 
+13
source share
4 answers

scikit learn sorts the labels in ascending order, so 0 is the first column / row and 1 is the second

 >>> from sklearn.metrics import confusion_matrix as cm >>> y_test = [1, 0, 0] >>> y_pred = [1, 0, 0] >>> cm(y_test, y_pred) array([[2, 0], [0, 1]]) >>> y_pred = [4, 0, 0] >>> y_test = [4, 0, 0] >>> cm(y_test, y_pred) array([[2, 0], [0, 1]]) >>> y_test = [-2, 0, 0] >>> y_pred = [-2, 0, 0] >>> cm(y_test, y_pred) array([[1, 0], [0, 2]]) >>> 

This is written in docs :

labels: array, shape = [n_classes], optional A list of labels for indexing the matrix. This can be used to reorder or select a subset of labels. If none are specified , those that appear at least once in y_true or y_pred are used in sorted order .

This way you can change this behavior by providing labels for calling confusion_matrix

 >>> y_test = [1, 0, 0] >>> y_pred = [1, 0, 0] >>> cm(y_pred, y_pred) array([[2, 0], [0, 1]]) >>> cm(y_pred, y_pred, labels=[1, 0]) array([[1, 0], [0, 2]]) 

And actual / predicted are used as in your images - forecasts are displayed in columns and actual values ​​in rows

 >>> y_test = [5, 5, 5, 0, 0, 0] >>> y_pred = [5, 0, 0, 0, 0, 0] >>> cm(y_test, y_pred) array([[3, 0], [2, 1]]) 
  • true: 0, predicted: 0 (value: 3, position [0, 0])
  • true: 5, predicted: 0 (value: 2, position [1, 0])
  • true: 0, predicted: 5 (value: 0, position [0, 1])
  • true: 5, predicted: 5 (value: 1, position [1, 1])
+22
source

Short answer In binary classification, using labels arguments

 confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0], labels=[0,1]).ravel() 

labels of classes 0 and 1 are considered Negative and Positive respectively. This is due to the order indicated in the list, and not to the alphanumeric order.


Verification Consider the labels of the imbalance class, for example: (using the imbalance class to facilitate the difference)

 >>> y_true = [0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0] >>> y_pred = [0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0] >>> table = confusion_matrix(y_true, y_pred, labeels=[0,1]).reval() 

this will give you confusion in the following table:

 >>> table array([12, 1, 2, 1]) 

which corresponds to:

  Actual | 1 | 0 | ___________________ pred 1 | TP=1 | FP=1 | 0 | FN=2 | TN=12| 

where FN=2 means that there were 2 cases when the model predicted the sample as negative (i.e. 0 ), but the actual label was positive (i.e. 1 ), therefore, False Negative is 2.

Similarly for TN=12 , in 12 cases, the model correctly predicted the negative class ( 0 ), therefore, True Negative is 12.

Thus, everything develops under the assumption that sklearn considers the first label (in labels=[0,1] as a negative class. Therefore, here 0 , the first label, represents a negative class.

+1
source

Following the example of wikipedia . If the classification system has been trained to distinguish between cats and non-cats, the confusion will summarize the test results of the algorithm for further verification. Assuming a sample of 27 animals - 8 cats and 19 not cats, the resulting confusion may look like the one shown in the table below:

enter image description here

With sclear

If you want to preserve the structure of the confusion matrix on Wikipedia, first go to the predicted values ​​and then to the actual class.

 from sklearn.metrics import confusion_matrix y_true = [0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,1,0,1,0,0,0,0] y_pred = [0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0] confusion_matrix(y_pred, y_true, labels=[1,0]) Out[1]: array([[ 5, 2], [ 3, 17]], dtype=int64) 

Another way with a pandas crosstab

 true = pd.Categorical(list(np.where(np.array(y_true) == 1, 'cat','non-cat')), categories = ['cat','non-cat']) pred = pd.Categorical(list(np.where(np.array(y_pred) == 1, 'cat','non-cat')), categories = ['cat','non-cat']) pd.crosstab(pred, true, rownames=['pred'], colnames=['Actual'], margins=False, margins_name="Total") Out[2]: Actual cat non-cat pred cat 5 2 non-cat 3 17 

I hope this serves you

+1
source

Supporting answer:

When drawing confusion matrix values ​​with sklearn.metrics, remember that the order of the values

[True Negative False Positive] [False Negative True Positive]

If you misinterpret the values ​​of, say, TP for TN, your accuracy and AUC_ROC will more or less coincide, but your accuracy, feedback, sensitivity and f1-score will come under attack, and you will get completely different metrics. This will cause you to erroneously evaluate the performance of your model.

Be sure to clearly define what the numbers 1 and 0 are in your model. This greatly dictates the results of the confusion.

Experience:

I worked on predicting fraud (binary classification with control), where fraud was designated 1 and not fraud - 0. My model was trained on an enlarged, perfectly balanced dataset , therefore, during testing over time, the values ​​of the confusion matrix did not seem suspicious when my results were of the order [TP FP] [FN TN]

Later, when I had to do an extraordinary test on a new unbalanced test suite , I realized that the above order of the confusion matrix was wrong and different from the sklearn mentioned on the documentation page, which refers to the order as tn, fp, fn, tp . The inclusion of a new order made me realize the gross error and the fact that this influenced my opinion on the performance of the model.

0
source

All Articles