The following code:
from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() lb.fit_transform(['yes', 'no', 'no', 'yes'])
returns:
array([[1], [0], [0], [1]])
However, I would like there to be one column for each class:
array([[1, 0], [0, 1], [0, 1], [1, 0]])
(I need data in this format, so I can transfer it to a neural network that uses the softmax function at the output level)
If there are more than 2 classes, LabelBinarizer behaves as desired:
from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() lb.fit_transform(['yes', 'no', 'no', 'yes', 'maybe'])
returns
array([[0, 0, 1], [0, 1, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]])
Above: 1 class for each class.
Is there an easy way to achieve the same (1 column per class) when there are 2 classes?
Edit: based on yangjie's answer, I wrote a class to wrap LabelBinarizer to create the desired behavior described above: http://pastebin.com/UEL2dP62
import numpy as np from sklearn.preprocessing import LabelBinarizer class LabelBinarizer2: def __init__(self): self.lb = LabelBinarizer() def fit(self, X): # Convert X to array X = np.array(X) # Fit X using the LabelBinarizer object self.lb.fit(X) # Save the classes self.classes_ = self.lb.classes_ def fit_transform(self, X): # Convert X to array X = np.array(X) # Fit + transform X using the LabelBinarizer object Xlb = self.lb.fit_transform(X) # Save the classes self.classes_ = self.lb.classes_ if len(self.classes_) == 2: Xlb = np.hstack((Xlb, 1 - Xlb)) return Xlb def transform(self, X): # Convert X to array X = np.array(X) # Transform X using the LabelBinarizer object Xlb = self.lb.transform(X) if len(self.classes_) == 2: Xlb = np.hstack((Xlb, 1 - Xlb)) return Xlb def inverse_transform(self, Xlb): # Convert Xlb to array Xlb = np.array(Xlb) if len(self.classes_) == 2: X = self.lb.inverse_transform(Xlb[:, 0]) else: X = self.lb.inverse_transform(Xlb) return X
Edit 2: It turns out that Yangjie also wrote a new version of LabelBinarizer, amazing!