What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit?

Can someone explain (perhaps an example) what is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit-learn?

I read the documentation and I realized that we are using:

  • OneVsRestClassifier - when we want to do multiclass or multicolumn classification, and the strategy consists of fitting one classifier to a class . For each classifier, the class is set against all other classes. (This is pretty clear, and that means that the multiclass / multicolumn classification problem is broken down into several binary classification problems).
  • MultiOutputClassifier - when we want to do multi-purpose classification (what is it?), And the strategy consists of fitting one classifier to the target (what does the target point mean?)

I already used OneVsRestClassifier for multi-segment classification, and I can understand how it works, but then I found a MultiOutputClassifier and I can’t understand how it works differently from OneVsRestClassifier.

+7
python scikit-learn classification multilabel-classification
source share
1 answer

Multiclass Classification

To better illustrate the differences, suppose your goal is to classify SO questions in n_classes different, mutually exclusive classes. For simplicity, we will look at only four classes in this example: 'Python' , 'Java' , 'C++' and 'Other language' . Suppose you have a data set formed by only six SO questions, and the class labels of these questions are stored in the y array as follows:

 import numpy as np y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python']) 

The situation described above is usually called the classification of multiclasses (also known as multicomponent classification). To match the classifier and validate the model through the scikit-learn library, you need to convert the labels of the text class to numeric labels. For this you can use LabelEncoder :

 from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_numeric = le.fit_transform(y) 

Here's how the labels for your dataset are encoded:

 In [220]: y_numeric Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64) 

where these numbers denote the indices of the following array:

 In [221]: le.classes_ Out[221]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S14') 

An important special case is the presence of only two classes, i.e. n_classes = 2 . This is commonly referred to as binary classification .

Multi-class classification

Suppose you want to perform this classification of multiclasses using the n_classes binary classifier n_classes , while n_classes number of different classes. Each of these binary classifiers decides whether an element belongs to a particular class or not. In this case, you cannot encode class labels as integers from 0 to n_classes - 1 , you need to create a 2-dimensional matrix of indicators instead. Consider that sample n has class k . Then the record [n, k] the indicator matrix 1 , and the remaining elements in row n are 0 . It is important to note that if the classes are not mutually exclusive, there can be several 1 in a row. This approach is called multiclass classification and can be easily implemented through MultiLabelBinarizer :

 from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() y_indicator = mlb.fit_transform(y[:, None]) 

The indicator is as follows:

 In [225]: y_indicator Out[225]: array([[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 0, 1]]) 

and column numbers, where 1 are actually the indices of this array:

 In [226]: mlb.classes_ Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object) 

Multiple Output Classification

What if you want to classify a specific SO question according to two different criteria at the same time, for example, language and application? In this case, you intend to do a multioutput classification. For simplicity, I will consider only three classes of applications, namely 'Computer Vision' , 'Speech Processing ' and 'Other application '. The label array of your dataset should be two-dimensional:

 y2 = np.asarray([['Java', 'Computer Vision'], ['C++', 'Speech Recognition'], ['Other language', 'Computer Vision'], ['Python', 'Other Application'], ['C++', 'Speech Recognition'], ['Python', 'Computer Vision']]) 

Again, we need to convert text class labels to numeric labels. As far as I know, this functionality is not yet implemented in scikit-learn, so you will need to write your own code. This thread describes some smart ways to do this, but for the purposes of this publication, the following single-line is sufficient:

 y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T 

Coded shortcuts look like this:

 In [229]: y_multi Out[229]: array([[1, 0], [0, 2], [2, 0], [3, 1], [0, 2], [3, 0]], dtype=int64) 

And the value of the values ​​in each column can be inferred from the following arrays:

 In [230]: le.fit(y2[:, 0]).classes_ Out[230]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S18') In [231]: le.fit(y2[:, 1]).classes_ Out[231]: array(['Computer Vision', 'Other Application', 'Speech Recognition'], dtype='|S18') 
+10
source share

All Articles