What is the difference between skelarn LabelEncoder and pd.get_dummies?

I wanted to know the difference between sklearn LabelEncoder and pandas get_dummies. Why choose LabelEncoder on top of get_dummies. What is the advantage of using one over the other? Disadvantages?

As far as I understand, if I have class A

ClassA = ["Apple", "Ball", "Cat"] encoder = [1, 2, 3] 

and

 dummy = [001, 010, 100] 

I get it wrong?

+6
source share
2 answers

These are just handy functions that naturally fit into the way these two libraries tend to do things, respectively. The first "condenses" the information, changing things by integers, and the second "expands" the sizes, allowing (possibly) more convenient access.


sklearn.preprocessing.LabelEncoder simply converts data from any domain, so its domain is 0, ..., k - 1, where k is the number of classes.

So for example

 ["paris", "paris", "tokyo", "amsterdam"] 

can be

 [0, 0, 1, 2] 

pandas.get_dummies also takes a series with elements from a certain domain, but extends it into a DataFrame whose columns correspond to the records in the series and the values ​​are 0 or 1 depending on what they were originally from. So, for example, the same

 ["paris", "paris", "tokyo", "amsterdam"] 

will become a dataframe with labels

 ["paris", "tokyo", "amsterdam"] 

and whose record "paris" would be a series

 [1, 1, 0, 0] 

The main advantage of the first method is that it saves space. Conversely, encoding things as integers can give the impression (to you or some machine learning algorithm) that ordering means something. Is Amsterdam closer to tokyo than to paris just because of integer encoding? probably no. The second view is a little more clear.

+3
source

pandas.get_dummies is one hot coding, but sklearn.preprocessing.LabelEncoder is incremental coding like 0,1,2,3,4, ...

One-time coding is more suitable for machine learning. Since labels are independent of each other, for example, 2 does not mean twice that the value is 1.

If the training set and test set have a different number of classes for the same function, refer to Store the same dummy variable in the training and testing data for two solutions.

0
source

All Articles