The most elegant way to get back from pandas.df_dummies

From a data frame with numerical and nominal data:

>>> from pandas import pd >>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'}, 'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'}, 'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}} >>> df = pd.DataFrame.from_dict(d) >>> df Budget m qj 0 39 M1 q23 1 15 M2 q4 2 13 M7 q9 3 53 M1 q23 4 82 M2 q23 5 70 M1 q9 

get_dummies converts a categorical variable into dummy / indicator variables:

 >>> df_dummies = pd.get_dummies(df) >>> df_dummies Budget m_M1 m_M2 m_M7 qj_q23 qj_q4 qj_q9 0 39 1 0 0 1 0 0 1 15 0 1 0 0 1 0 2 13 0 0 1 0 0 1 3 53 1 0 0 1 0 0 4 82 0 1 0 1 0 0 5 70 1 0 0 0 0 1 

What is the most elegant back_from_dummies way to get back from df_dummies to df?

 >>> (back_from_dummies(df_dummies) == df).all() Budget True m True qj True dtype: bool 
+7
python pandas
source share
3 answers

idxmax will make this pretty easy.

 from itertools import groupby def back_from_dummies(df): result_series = {} # Find dummy columns and build pairs (category, category_value) dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col] # Find non-dummy columns that do not have a _ non_dummy_cols = [col for col in df.columns if "_" not in col] # For each category column group use idxmax to find the value. for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]): #Select columns for each category dummy_df = df[[col[1] for col in cols]] # Find max value among columns max_columns = dummy_df.idxmax(axis=1) # Remove category_ prefix result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1]) # Copy non-dummy columns over. for col in non_dummy_cols: result_series[col] = df[col] # Return dataframe of the resulting series return pd.DataFrame(result_series) (back_from_dummies(df_dummies) == df).all() 
+2
source share

First, split the columns:

 In [11]: from collections import defaultdict pos = defaultdict(list) vals = defaultdict(list) In [12]: for i, c in enumerate(df_dummies.columns): if "_" in c: k, v = c.split("_", 1) pos[k].append(i) vals[k].append(v) else: pos["_"].append(i) In [13]: pos Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]}) In [14]: vals Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']}) 

This allows you to slice different frames for each dummy column:

 In [15]: df_dummies.iloc[:, pos["m"]] Out[15]: m_M1 m_M2 m_M7 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 0 1 0 5 1 0 0 

Now we can use numpy argmax:

 In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1) Out[16]: array([0, 1, 2, 0, 1, 0]) 

* Note: pandas idxmax returns the label, we want this position so that we can use categorical elements. *

 In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"]) Out[17]: [M1, M2, M7, M1, M2, M1] Categories (3, object): [M1, M2, M7] 

Now we can put it all together:

 In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals}) In [22]: df Out[22]: m qj 0 M1 q23 1 M2 q4 2 M7 q9 3 M1 q23 4 M2 q23 5 M1 q9 

and return non-empty columns:

 In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]] In [24]: df Out[24]: m qj Budget 0 M1 q23 39 1 M2 q4 15 2 M7 q9 13 3 M1 q23 53 4 M2 q23 82 5 M1 q9 70 

As a function:

 def reverse_dummy(df_dummies): pos = defaultdict(list) vals = defaultdict(list) for i, c in enumerate(df_dummies.columns): if "_" in c: k, v = c.split("_", 1) pos[k].append(i) vals[k].append(v) else: pos["_"].append(i) df = pd.DataFrame({k: pd.Categorical.from_codes( np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals}) df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]] return df In [31]: reverse_dummy(df_dummies) Out[31]: m qj Budget 0 M1 q23 39 1 M2 q4 15 2 M7 q9 13 3 M1 q23 53 4 M2 q23 82 5 M1 q9 70 
+1
source share

Like @David, I found that idxmax will do most of the work for you. I think there is no reliable way to ensure that you have no problem trying to convert columns, however, since in some cases it can be difficult to determine which columns are dummies and which are not. I find that this can be greatly facilitated by using a delimiter that is unlikely to happen in your data by accident. _ often used in column names that have multiple words, so I use __ (double underscore) as a separator; I never came across this in a column name in the wild.

Also note that pd.get_dummies move all dummy columns to the end. This means that you cannot get the original column order.

Here is an example of my approach. You can recognize dummy columns as those with sep . We get groups of dummy columns using df.filter , which will allow us to match column names using regular expressions (only the part of the name before sep works, there are other ways to do this part).

The rename part rejects the beginning of column names (for example, m__ ), so the rest is the value. idxmax then retrieves the name of the column in which there is 1 . This gives us a data frame from pd.get_dummies on one of the source columns; we combine the data frames with the pd.get_dummies to each of the columns together with other_cols - those columns that were not "deleted".

 In [1]: import pandas as pd In [2]: df = pd.DataFrame.from_dict({'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'}, ...: 'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'}, ...: 'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}) In [3]: df Out[3]: Budget m qj 0 39 M1 q23 1 15 M2 q4 2 13 M7 q9 3 53 M1 q23 4 82 M2 q23 5 70 M1 q9 In [4]: sep = '__' In [5]: dummies = pd.get_dummies(df, prefix_sep=sep) In [6]: dummies Out[6]: Budget m__M1 m__M2 m__M7 qj__q23 qj__q4 qj__q9 0 39 1 0 0 1 0 0 1 15 0 1 0 0 1 0 2 13 0 0 1 0 0 1 3 53 1 0 0 1 0 0 4 82 0 1 0 1 0 0 5 70 1 0 0 0 0 1 In [7]: dfs = [] ...: ...: dummy_cols = list(set(col.split(sep)[0] for col in dummies.columns if sep in col)) ...: other_cols = [col for col in dummies.columns if sep not in col] ...: ...: for col in dummy_cols: ...: dfs.append(dummies.filter(regex=col).rename(columns=lambda name: name.split(sep)[1]).idxmax(axis=1)) ...: ...: df = pd.concat(dfs + [dummies[other_cols]], axis=1) ...: df.columns = dummy_cols + other_cols ...: df ...: Out[7]: qj m Budget 0 q23 M1 39 1 q4 M2 15 2 q9 M7 13 3 q23 M1 53 4 q23 M2 82 5 q9 M1 70 
0
source share

All Articles