Pandas: get_dummies versus categorical

Question

Pandas: get_dummies versus categorical

I have a dataset that contains several columns with categorical data.

I use the Categorical function to replace categorical values with numeric ones.

data[column] = pd.Categorical.from_array(data[column]).codes

I recently came across the pandas.get_dummies function. Are they interchangeable? Is there any advantage to using one over the other?

+7

python pandas categorical-data dummy-data

sapo_cosmico Mar 23 '15 at 10:50

source share

1 answer

Alexander · Answer 1 · 2015-03-23T23:41:09+0000

Why do you convert categorical data to integers? I do not believe that you preserve memory if that is your goal.

 df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])}) df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]}) >>> df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 6 entries, 0 to 5 Data columns (total 1 columns): cat 6 non-null category dtypes: category(1) memory usage: 78.0 bytes >>> df2.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 6 entries, 0 to 5 Data columns (total 1 columns): cat 6 non-null int64 dtypes: int64(1) memory usage: 96.0 bytes

Categorical codes are integer values for unique elements in a given category. In contrast, get_dummies returns a new column for each unique element. The value in the column indicates whether the entry has this attribute.

 >>> pd.core.reshape.get_dummies(df) Out[30]: cat_a cat_b cat_c 0 1 0 0 1 1 0 0 2 1 0 0 3 0 1 0 4 0 1 0 5 0 0 1

To get the codes directly, you can use:

 df['codes'] = [df.cat.codes.to_list()]

Pandas: get_dummies versus categorical

More articles: