How to use OneHotEncoder output in sklearn?

Question

How to use OneHotEncoder output in sklearn?

I have a Pandas Dataframe with 2 categorical variables and an ID variable and a target variable (for classification). I was able to convert categorical values using OneHotEncoder . This results in a sparse matrix.

 ohe = OneHotEncoder() # First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input ... remapping code ... ohe.fit(df[['col_a', 'col_b']]) ohe.transform(df[['col_a', 'col_b']])

But I have no idea how I can use this sparse matrix in DecisionTreeClassifier? Especially when I want to add some other non-categorical variables to my DataFrame later. Thanks!

EDIT In response to miraculixx comment: I also tried DataFrameMapper in sklearn- pandas

 mapper = DataFrameMapper([ ('id_col', None), ('target_col', None), (['col_a'], OneHotEncoder()), (['col_b'], OneHotEncoder()) ]) t = mapper.fit_transform(df)

But then I get this error:

TypeError: there is no supported conversion for types: (dtype ('O'), dtype ('int64'), dtype ('float64'), dtype ('float64')).

+6

python pandas scikit-learn classification

Bert carremans Jul 21 '16 at 21:28

source share

2 answers

Take a look at this example from scikit-learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py

The problem is that you are not using sparse matrices until xx.fit() . You are using raw data.

+1

Merlin Jul 22 '16 at 6:46

source share

Guiem bosch · Accepted Answer · 2016-07-22T06:16:37+0000

I see that you are already using Pandas, so why not use its get_dummies function?

 import pandas as pd df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])

result

  name age-group 0 rick young 1 phil old 2 john teenager

now you code using get_dummies

 pd.get_dummies(df)

result

 name_john name_phil name_rick age-group_old age-group_teenager \ 0 0 0 1 0 0 1 0 1 0 1 0 2 1 0 0 0 1 age-group_young 0 1 1 0 2 0

And you can actually use the new Pandas DataFrame in your Sklearn DecisionTreeClassifier.

How to use OneHotEncoder output in sklearn?

More articles: