Multiclass Classification
To better illustrate the differences, suppose your goal is to classify SO questions in n_classes different, mutually exclusive classes. For simplicity, we will look at only four classes in this example: 'Python' , 'Java' , 'C++' and 'Other language' . Suppose you have a data set formed by only six SO questions, and the class labels of these questions are stored in the y array as follows:
import numpy as np y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python'])
The situation described above is usually called the classification of multiclasses (also known as multicomponent classification). To match the classifier and validate the model through the scikit-learn library, you need to convert the labels of the text class to numeric labels. For this you can use LabelEncoder :
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_numeric = le.fit_transform(y)
Here's how the labels for your dataset are encoded:
In [220]: y_numeric Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64)
where these numbers denote the indices of the following array:
In [221]: le.classes_ Out[221]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S14')
An important special case is the presence of only two classes, i.e. n_classes = 2 . This is commonly referred to as binary classification .
Multi-class classification
Suppose you want to perform this classification of multiclasses using the n_classes binary classifier n_classes , while n_classes number of different classes. Each of these binary classifiers decides whether an element belongs to a particular class or not. In this case, you cannot encode class labels as integers from 0 to n_classes - 1 , you need to create a 2-dimensional matrix of indicators instead. Consider that sample n has class k . Then the record [n, k] the indicator matrix 1 , and the remaining elements in row n are 0 . It is important to note that if the classes are not mutually exclusive, there can be several 1 in a row. This approach is called multiclass classification and can be easily implemented through MultiLabelBinarizer :
from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() y_indicator = mlb.fit_transform(y[:, None])
The indicator is as follows:
In [225]: y_indicator Out[225]: array([[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 0, 1]])
and column numbers, where 1 are actually the indices of this array:
In [226]: mlb.classes_ Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object)
Multiple Output Classification
What if you want to classify a specific SO question according to two different criteria at the same time, for example, language and application? In this case, you intend to do a multioutput classification. For simplicity, I will consider only three classes of applications, namely 'Computer Vision' , 'Speech Processing ' and 'Other application '. The label array of your dataset should be two-dimensional:
y2 = np.asarray([['Java', 'Computer Vision'], ['C++', 'Speech Recognition'], ['Other language', 'Computer Vision'], ['Python', 'Other Application'], ['C++', 'Speech Recognition'], ['Python', 'Computer Vision']])
Again, we need to convert text class labels to numeric labels. As far as I know, this functionality is not yet implemented in scikit-learn, so you will need to write your own code. This thread describes some smart ways to do this, but for the purposes of this publication, the following single-line is sufficient:
y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T
Coded shortcuts look like this:
In [229]: y_multi Out[229]: array([[1, 0], [0, 2], [2, 0], [3, 1], [0, 2], [3, 0]], dtype=int64)
And the value of the values ββin each column can be inferred from the following arrays:
In [230]: le.fit(y2[:, 0]).classes_ Out[230]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S18') In [231]: le.fit(y2[:, 1]).classes_ Out[231]: array(['Computer Vision', 'Other Application', 'Speech Recognition'], dtype='|S18')