How to calculate accuracy, recall, accuracy and f1-score for a multiclass case using scikit?

I am working on a mood analysis problem, the data looks like this:

label instances 5 1190 4 838 3 239 1 204 2 127 

So my data is not balanced, as 1190 instances are tagged 5 . For classification Im scikit SVC is used . The problem is that I donโ€™t know how to balance my data correctly in order to accurately calculate the accuracy, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:

Firstly:

  wclf = SVC(kernel='linear', C= 1, class_weight={1: 10}) wclf.fit(X, y) weighted_prediction = wclf.predict(X_test) print 'Accuracy:', accuracy_score(y_test, weighted_prediction) print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted') print 'Recall:', recall_score(y_test, weighted_prediction, average='weighted') print 'Precision:', precision_score(y_test, weighted_prediction, average='weighted') print '\n clasification report:\n', classification_report(y_test, weighted_prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction) 

Secondly:

 auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto') auto_wclf.fit(X, y) auto_weighted_prediction = auto_wclf.predict(X_test) print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction) print 'F1 score:', f1_score(y_test, auto_weighted_prediction, average='weighted') print 'Recall:', recall_score(y_test, auto_weighted_prediction, average='weighted') print 'Precision:', precision_score(y_test, auto_weighted_prediction, average='weighted') print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction) 

Third:

 clf = SVC(kernel='linear', C= 1) clf.fit(X, y) prediction = clf.predict(X_test) from sklearn.metrics import precision_score, \ recall_score, confusion_matrix, classification_report, \ accuracy_score, f1_score print 'Accuracy:', accuracy_score(y_test, prediction) print 'F1 score:', f1_score(y_test, prediction) print 'Recall:', recall_score(y_test, prediction) print 'Precision:', precision_score(y_test, prediction) print '\n clasification report:\n', classification_report(y_test,prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, prediction) F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". sample_weight=sample_weight) /usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". sample_weight=sample_weight) /usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". sample_weight=sample_weight) 0.930416613529 

However, I get warnings like this:

 /usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1" 

How can I correctly process my unbalanced data to correctly calculate classifier metrics?

+84
python scikit-learn artificial-intelligence machine-learning nlp
Jul 15 '15 at 4:17
source share
4 answers

I think there is a lot of confusion as to which weights are used for what. I'm not sure that I know for sure what is bothering you, so I will cover various topics, carrying with me;).

Class weight

The weights from the class_weight parameter class_weight used to train the classifier . They are not used in calculating any of the metrics you use : with different class weights, the numbers will differ simply because the classifier is different.

Basically, in every scikit-learn classifier, class weights are used to tell your model how important the class is. This means that during training, the classifier will make additional efforts to correctly classify classes with a large weight.
How they do it depends on the algorithm. If you need details on how this works for SVC, and the document doesn't make sense to you, feel free to mention it.

Indicators

Once you have a classifier, you want to know how effective it is. Here you can use the mentioned indicators: accuracy , recall_score , f1_score ...

Usually, when the distribution of classes is unbalanced, accuracy is considered a poor choice because it gives high marks to models that simply predict the most frequent class.

I will not describe all these indicators in detail, but note that, with the exception of accuracy , they are naturally applied at the class level: as you can see in this print classification report, they are defined for each class. They rely on concepts such as true positives or false negative , which require a determination of which class is positive.

  precision recall f1-score support 0 0.65 1.00 0.79 17 1 0.57 0.75 0.65 16 2 0.33 0.06 0.10 17 avg / total 0.52 0.60 0.51 50 

Warning

 F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". 

You get this warning because you use f1-score, feedback, and accuracy, without deciding how to calculate them! The question can be rephrased: from the classification report above, how do you derive the global number one for f1-score? You could:

  • Take the average f1-score for each class: the result of avg / total higher. This is also called macro averaging.
  • Calculate f1-score using the global score of true positive / false negatives, etc. (you summarize the number of true positive / false negatives for each class). Aka micro averaging.
  • Calculate the weighted average f1-score. Using 'weighted' in scikit-learn will weigh the f1-score while supporting the class: the more elements the class has, the more important the f1 value for this class is in computing.

These are 3 options in scikit-learn, the warning is that you must choose one . Therefore, you need to specify the argument average for the valuation method.

Which of the options you choose depends on how you want to measure the performance of the classifier: for example, macro averaging does not take into account class imbalance, and f1-rating of class 1 will be as important as f1-rating of class 5. If you use weighted averaging , the more you get a greater value for class 5.

The entire argument specification in these metrics is not very clear in scikit-learn right now, it will be better in version 0.18 according to the docs. They eliminate some unobvious standard actions, and they give warnings so that developers notice this.

Scoring

The last thing I want to mention (feel free to skip it if you know about it) is that the estimates only make sense if they are calculated from data that the classifier never saw . This is extremely important, since any score that you get from the data that was used to set the classifier is completely irrelevant.

Here's a way to do it with StratifiedShuffleSplit , which gives you random splits of your data (after shuffling) that keep the labels spread.

 from sklearn.datasets import make_classification from sklearn.cross_validation import StratifiedShuffleSplit from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix # We use a utility to generate artificial classification data. X, y = make_classification(n_samples=100, n_informative=10, n_classes=3) sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0) for train_idx, test_idx in sss: X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx] svc.fit(X_train, y_train) y_pred = svc.predict(X_test) print(f1_score(y_test, y_pred, average="macro")) print(precision_score(y_test, y_pred, average="macro")) print(recall_score(y_test, y_pred, average="macro")) 

Hope this helps.

+125
Jul 22 '15 at 23:44
source share

There are many very detailed answers here, but I donโ€™t think you are answering the right questions. As far as I understand, there are two problems:

  • How to score a multiclass problem?
  • How do I work with unbalanced data?

one.

You can use most scikit-learn scoring functions with both a multiclass problem and problems with a single class. Example:.

 from sklearn.metrics import precision_recall_fscore_support as score predicted = [1,2,3,4,5,1,2,1,1,4,5] y_test = [1,2,3,4,5,1,2,1,1,4,1] precision, recall, fscore, support = score(y_test, predicted) print('precision: {}'.format(precision)) print('recall: {}'.format(recall)) print('fscore: {}'.format(fscore)) print('support: {}'.format(support)) 

Thus, you get tangible and interpreted numbers for each of the classes.

 | Label | Precision | Recall | FScore | Support | |-------|-----------|--------|--------|---------| | 1 | 94% | 83% | 0.88 | 204 | | 2 | 71% | 50% | 0.54 | 127 | | ... | ... | ... | ... | ... | | 4 | 80% | 98% | 0.89 | 838 | | 5 | 93% | 81% | 0.91 | 1190 | 

Then...

2.

... you can determine if unbalanced information is a problem. If the count for less represented classes (classes 1 and 2) is lower than for classes with a large number of training samples (classes 4 and 5), then you know that unbalanced data is actually a problem, and you can act accordingly, since the described in some other answers in this thread. However, if the data you want to predict has the same class distribution, your data on unbalanced training is a good representative of the data, and therefore imbalance is a good thing.

+57
Jul 23 '15 at 12:35
source share

Asked Question

Answering the question "which metric should be used to classify several classes with unbalanced data": macro-F1-measure. Precision and Macro Recall macros can also be used, but they are not as easily interpreted as binary classification, they are already included in the F-measure, and redundant indicators complicate the comparison of methods, parameter settings, etc.

Micro-averaging is sensitive to class imbalances: if your method, for example, works well for the most common tags and is completely messy, other micro-averages show good results.

Average weighting is not very suitable for unbalanced data, since it weighs according to the number of labels. Moreover, it is too difficult to interpret and unpopular: for example, the mention of such averaging in the next very detailed survey, I highly recommend looking at:

Sokolova, Marina and Guy Lapalme. "Systematic analysis of performance indicators for classification tasks." Information Processing and Management 45.4 (2009): 427-437.

Application Specific Question

However, returning to your task, I would investigate 2 topics:

  • indicators commonly used for your specific task - this allows you to (a) compare your method with others and understand if you do something wrong, and (b) not research it yourself and reuse someone else's conclusions;
  • the cost of the various errors of your methods - for example, the precedent of your application can rely on 4- and 5-star only in reviews - in this case, a good metric should take into account only these 2 labels.

Commonly used indicators. How can I conclude by looking at the literature, there are two main evaluation indicators:

  • Accuracy , which is used, for example. at

Yu, April and Daryl Chang. "Predicting Yelp Business Multi-Class Sentiments."

( link ) - note that the authors work with almost the same distribution of ratings, see Figure 5.

Pang, Bo and Lilian Lee. "Observing stars: Exploiting class relationships to categorize proposals for rating scales." Proceedings of the 43rd Annual Meeting on Computational Linguistics. Association of Computational Linguistics, 2005.

( link )

  1. MSE (or, more rarely, average absolute error - MAE ) - see, for example,

Lee, Moonte, and R. Graf. "Analysis of multiclass moods using restaurant reviews." Final projects from CS N 224 (2010).

( link ) - they research both accuracy and MSE, finding the latter better

Pappas, Nikolaos, Ryu Marconi and Andrei Popescu-Belis. "Clarification of the Stars: Weighted, multi-level learning for aspectual Mood Analysis." 2014 Conference Materials on Empirical Methods of Natural Language Processing. No. EPFL-CONF-200899. 2014.

( link ) - they use scikit-learn for evaluation and basic approaches and state that their code is available; however, I cannot find it, so if you need it, write a letter to the authors, the work is quite new and seems to be written in Python.

The cost of various errors. If you no longer need to avoid blunders, for example. When looking at a 1-star 5-star review or something like that, look at the MSE; if the difference matters, but not so much, try MAE, as this is not a square diff; otherwise stay accurate.

About approaches, not metrics

Try regression approaches, for example. SVRs because they usually outperform multiclass classifiers such as SVC or OVA SVM.

+15
Jul 22 '15 at 17:53
source share

First of all, it is a little more complicated, using only the analysis count to determine if your data is unsatisfied or not. For example: 1 out of 1000 positive observations is just noise, a mistake or a breakthrough in science? You never know. Therefore, it is always better to use all available knowledge and choose your status with all the wise.

Well, what if it's really unbalanced?
Once again - look at your data. Sometimes you can find one or two observations multiplied by a hundred times. It is sometimes useful to create these fake classmates.
If all the data is clean, the next step is to use the class weights in the prediction model.

What about multiclassical metrics?
In my experience, none of your metrics are commonly used. There are two main reasons.
First: it is always better to work with probabilities than with reliable prediction (because otherwise you could separate models with a forecast of 0.9 and 0.6 if they both give you the same class?)
And second: itโ€™s much easier to compare your forecasting models and build new ones, depending on only one good metric.
In my experience, I could recommend logloss or MSE (or just the root mean square error).

How to fix sklearn warnings?
Simply (as Yangjie noted) overwrite the average parameter with one of these values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (the same as a macro, but with automatic weights).

 f1_score(y_test, prediction, average='weighted') 

All your warnings appeared after calling metric functions with a default value of average 'binary' , which is not suitable for forecasting multiclasses.
Good luck and enjoy machine learning!

Edit:
I found another recommendation for responding to regression approaches (e.g. SVR) that I cannot agree with. As far as I remember, there is not even such a thing as multiclass regression. Yes, there is a multi-vector regression that is very different and yes, perhaps in some cases you can switch between regression and classification (if the classes are sorted somehow), but this is quite rare.

I would recommend (as part of scikit-learn) to try other very powerful classification tools: gradient enhancement , random forest (my favorite), KNeighbors and many others.

After that, you can calculate the arithmetic mean or geometric mean between the forecasts, and most of the time you will get an even better result.

 final_prediction = (KNNprediction * RFprediction) ** 0.5 
+12
Jul 22 '15 at 8:54
source share



All Articles