What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit?

Question

What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit?

Can someone explain (perhaps an example) what is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit-learn?

I read the documentation and I realized that we are using:

OneVsRestClassifier - when we want to do multiclass or multicolumn classification, and the strategy consists of fitting one classifier to a class . For each classifier, the class is set against all other classes. (This is pretty clear, and that means that the multiclass / multicolumn classification problem is broken down into several binary classification problems).
MultiOutputClassifier - when we want to do multi-purpose classification (what is it?), And the strategy consists of fitting one classifier to the target (what does the target point mean?)

I already used OneVsRestClassifier for multi-segment classification, and I can understand how it works, but then I found a MultiOutputClassifier and I can’t understand how it works differently from OneVsRestClassifier.

+7

python scikit-learn classification multilabel-classification

delusionX Mar 15 '17 at 19:54

source share

1 answer

Tonchas · Accepted Answer · 2017-03-23T13:48:39+0000

Multiclass Classification

To better illustrate the differences, suppose your goal is to classify SO questions in n_classes different, mutually exclusive classes. For simplicity, we will look at only four classes in this example: 'Python' , 'Java' , 'C++' and 'Other language' . Suppose you have a data set formed by only six SO questions, and the class labels of these questions are stored in the y array as follows:

 import numpy as np y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python'])

The situation described above is usually called the classification of multiclasses (also known as multicomponent classification). To match the classifier and validate the model through the scikit-learn library, you need to convert the labels of the text class to numeric labels. For this you can use LabelEncoder :

 from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_numeric = le.fit_transform(y)

Here's how the labels for your dataset are encoded:

 In [220]: y_numeric Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64)

where these numbers denote the indices of the following array:

 In [221]: le.classes_ Out[221]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S14')

An important special case is the presence of only two classes, i.e. n_classes = 2 . This is commonly referred to as binary classification .

Multi-class classification

Suppose you want to perform this classification of multiclasses using the n_classes binary classifier n_classes , while n_classes number of different classes. Each of these binary classifiers decides whether an element belongs to a particular class or not. In this case, you cannot encode class labels as integers from 0 to n_classes - 1 , you need to create a 2-dimensional matrix of indicators instead. Consider that sample n has class k . Then the record [n, k] the indicator matrix 1 , and the remaining elements in row n are 0 . It is important to note that if the classes are not mutually exclusive, there can be several 1 in a row. This approach is called multiclass classification and can be easily implemented through MultiLabelBinarizer :

 from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() y_indicator = mlb.fit_transform(y[:, None])

The indicator is as follows:

 In [225]: y_indicator Out[225]: array([[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 0, 1]])

and column numbers, where 1 are actually the indices of this array:

 In [226]: mlb.classes_ Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object)

Multiple Output Classification

What if you want to classify a specific SO question according to two different criteria at the same time, for example, language and application? In this case, you intend to do a multioutput classification. For simplicity, I will consider only three classes of applications, namely 'Computer Vision' , 'Speech Processing ' and 'Other application '. The label array of your dataset should be two-dimensional:

 y2 = np.asarray([['Java', 'Computer Vision'], ['C++', 'Speech Recognition'], ['Other language', 'Computer Vision'], ['Python', 'Other Application'], ['C++', 'Speech Recognition'], ['Python', 'Computer Vision']])

Again, we need to convert text class labels to numeric labels. As far as I know, this functionality is not yet implemented in scikit-learn, so you will need to write your own code. This thread describes some smart ways to do this, but for the purposes of this publication, the following single-line is sufficient:

 y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T

Coded shortcuts look like this:

 In [229]: y_multi Out[229]: array([[1, 0], [0, 2], [2, 0], [3, 1], [0, 2], [3, 0]], dtype=int64)

And the value of the values in each column can be inferred from the following arrays:

 In [230]: le.fit(y2[:, 0]).classes_ Out[230]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S18') In [231]: le.fit(y2[:, 1]).classes_ Out[231]: array(['Computer Vision', 'Other Application', 'Speech Recognition'], dtype='|S18')

What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit?

Multiclass Classification

Multi-class classification

Multiple Output Classification

More articles: