Python: how to normalize the confusion matrix?

I computed the confusion matrix for my classifier using the confusion_matrix () method from the sklearn package. The diagonal elements of the confusion matrix are the number of points for which the predicted label is equal to the true label, while the off-diagonal elements are those that are incorrectly labeled by the classifier.

I would like to normalize my confusion matrix so that it contains only numbers from 0 to 1. I would like to read the percentage of correctly classified samples from the matrix.

I found several methods for normalizing a matrix (normalizing rows and columns), but I know little about math and am not sure if this is the right approach. Can anyone help?

+18
python scikit-learn matrix normalization confusion-matrix
source share
7 answers

I assume that M[i,j] means Element of real class i was classified as j . If it is the other way around, you will need to transpose everything that I say. I am also going to use the following matrix for specific examples:

 1 2 3 4 5 6 7 8 9 

There are two things you can do:

Search for how each class is classified

The first thing you can ask is that the percentage of elements of the real class i here is classified as each class. To do this, take the row fixing i and divide each element by the sum of the elements in the row. In our example, objects of class 2 are classified as class 1 4 times, classified as class 2 5 times, and classified as class 3 6 times. To find the interest, we simply divide everything by the sum 4 + 5 + 6 = 15

 4/15 of the class 2 objects are classified as class 1 5/15 of the class 2 objects are classified as class 2 6/15 of the class 2 objects are classified as class 3 

Search for classes responsible for each classification

The second thing you can do is look at each result from your classifier and ask how many of these results come from each real class. This will be similar to another case, but with columns instead of rows. In our example, our classifier returns “1” 1 time when the source class is 1, 4 times, when the source class is 2 and 7 times, when the source class is 3. To find the percentages, we divide by the sum 1 + 4 + 7 = 12

 1/12 of the objects classified as class 1 were from class 1 4/12 of the objects classified as class 1 were from class 2 7/12 of the objects classified as class 1 were from class 3 

-

Of course, both of the methods that I gave apply only to a column with one row at a time, and I'm not sure it would be a good idea to really change your confusion matrix in this form. However, this should give the interest you are looking for.

+8
source share

Let's pretend that

 >>> y_true = [0, 0, 1, 1, 2, 0, 1] >>> y_pred = [0, 1, 0, 1, 2, 2, 1] >>> C = confusion_matrix(y_true, y_pred) >>> C array([[1, 1, 1], [1, 2, 0], [0, 0, 1]]) 

Then, to find out how many samples in the class got their correct label, you need to

 >>> C / C.astype(np.float).sum(axis=1) array([[ 0.33333333, 0.33333333, 1. ], [ 0.33333333, 0.66666667, 0. ], [ 0. , 0. , 1. ]]) 

The diagonal contains the required values. Another way to calculate this is to understand that you are calculating a call to a class:

 >>> from sklearn.metrics import precision_recall_fscore_support >>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred) >>> recall array([ 0.33333333, 0.66666667, 1. ]) 

Similarly, if you divide the sum by axis=0 , you will get accuracy (the share of class k predictions with the truth truth label k ):

 >>> C / C.astype(np.float).sum(axis=0) array([[ 0.5 , 0.33333333, 0.5 ], [ 0.5 , 0.66666667, 0. ], [ 0. , 0. , 0.5 ]]) >>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred) >>> prec array([ 0.5 , 0.66666667, 0.5 ]) 
+20
source share

Matrix output using sklearn confusion_matrix() is such that

C_ {i, j} is equal to the number of observations that are known to be in group i but predicted to be in group j

to get the percentages for each class (often called specificity and sensitivity in binary classification), you need to normalize by line: replace each element in the line by itself, divided by the sum of the elements of that line.

Note that sklearn has a summary function that computes metrics from the confusion matrix: class_report . It displays accuracy and recall, rather than specificity and sensitivity, but they are often seen as more informative in general (especially for an unbalanced multiclass classification).

+7
source share

From the sklearn documentation (sample graph)

 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 

where cm is the confusion matrix provided by sklearn.

+7
source share

There is a library provided by scikit-learn for graphing. It is based on matplotlib and must already be installed in order to continue.

 pip install scikit-plot 

Now just set the normalization parameter to true :

 import scikitplot as skplt skplt.metrics.plot_confusion_matrix(Y_TRUE, Y_PRED, normalize=True) 
0
source share

Using Seaborn, you can easily print out a normalized and rather confusing matrix with a void map:

enter image description here

 from sklearn.metrics import confusion_matrix import seaborn as sns cm = confusion_matrix(y_test, y_pred) # Normalise cmn = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] fig, ax = plt.subplots(figsize=(10,10)) sns.heatmap(cmn, annot=True, fmt='.2f', xticklabels=target_names, yticklabels=target_names) plt.ylabel('Actual') plt.xlabel('Predicted') plt.show(block=False) 
0
source share

I think the easiest way to do this is by doing:

 c = sklearn.metrics.confusion_matrix(y, y_pred) normed_c = (cT / c.astype(np.float).sum(axis=1)).T 
0
source share

All Articles