Python pandas: merge loses categorical columns

Question

Python pandas: merge loses categorical columns

I work with large DataFrames categorical data, and I found that when I use pandas.merge on two data frames, any columns of categorical data are automatically promoted to a higher data type. (This can significantly increase RAM consumption.) A simple example to illustrate:

EDIT: made a more suitable example

import pandas import numpy df1 = pandas.DataFrame( {'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0], 'value1': pandas.Categorical(numpy.random.randint(0, 2, 20))}) df2 = pandas.DataFrame( {'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1], 'value2': pandas.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])}) result = pandas.merge(df1, df2, on="ID") result.dtypes Out []: ID int32 value1 int64 value2 object dtype: object

I would like value1 and value2 to remain categorical as a result of a DataFrame. Converting string labels to an object type can be especially expensive.

From https://github.com/pydata/pandas/issues/8938 can this be as intended? Is there anyway to avoid this?

+5

python merge join pandas categorical-data

epicurus Mar 26 '15 at 13:59

source share

3 answers

Jeff · Answer 1 · 2015-03-26T22:30:03+0000

I may be missing a goal, but you intend to convert the user into a category (or not) if necessary. I think that in this particular case this can be done automatically. Honestly, categorical transformations would be made in the end anyway, so that it really won’t save you anything (by doing this inside the merge).

 In [57]: result = pandas.merge(df1, df2, on="ID") In [58]: result['value1'] = result['value1'].astype('category') In [59]: result['value2'] = result['value2'].astype('category') In [60]: result Out[60]: ID value1 value2 0 5 0 c 1 5 1 c 2 3 0 a 3 3 1 a 4 6 0 c 5 6 0 c 6 7 0 a 7 7 1 a 8 0 1 c 9 0 1 c 10 4 1 b 11 4 1 b 12 8 0 b 13 8 1 b 14 2 1 a 15 2 1 a 16 9 0 a 17 9 1 a 18 1 0 b 19 1 1 b In [61]: result.dtypes Out[61]: ID int64 value1 category value2 category dtype: object In [62]: result.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 20 entries, 0 to 19 Data columns (total 3 columns): ID 20 non-null int64 value1 20 non-null category value2 20 non-null category dtypes: category(2), int64(1) memory usage: 400.0 byte

unutbu · Answer 2 · 2015-03-26T14:24:02+0000

As a workaround, you can convert categorical columns to integer code values, and save the display of columns in categories in a dict. For instance,

 def decat(df): """ Convert categorical columns to (integer) codes; return the categories in catmap """ catmap = dict() for col, dtype in df.dtypes.iteritems(): if com.is_categorical_dtype(dtype): c = df[col].cat catmap[col] = c.categories df[col] = c.codes return df, catmap In [304]: df Out[304]: ID value2 0 5 c 1 3 a 2 6 c 3 7 a 4 0 c 5 4 b 6 8 b 7 2 a 8 9 a 9 1 b In [305]: df, catmap = decat(df) In [306]: df Out[306]: ID value2 0 5 2 1 3 0 2 6 2 3 7 0 4 0 2 5 4 1 6 8 1 7 2 0 8 9 0 9 1 1 In [307]: catmap Out[307]: {'value2': Index([u'a', u'b', u'c'], dtype='object')}

Now you can concatenate as usual, since there is no problem combining columns with integers.

Later, you can re-compile categorical columns using the data in catmap :

 def recat(df, catmap): """ Use catmap to reconstitute columns in df to categorical dtype """ for col, categories in catmap.iteritems(): df[col] = pd.Categorical(categories[df[col]]) df[col].cat.categories = categories return df

 import numpy as np import pandas as pd import pandas.core.common as com df1 = pd.DataFrame( {'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0], dtype='int32'), 'value1': pd.Categorical(np.random.randint(0, 2, 20))}) df2 = pd.DataFrame( {'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1], dtype='int32'), 'value2': pd.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])}) def decat(df): """ Convert categorical columns to (integer) codes; return the categories in catmap """ catmap = dict() for col, dtype in df.dtypes.iteritems(): if com.is_categorical_dtype(dtype): c = df[col].cat catmap[col] = c.categories df[col] = c.codes return df, catmap def recat(df, catmap): """ Use catmap to reconstitute columns in df to categorical dtype """ for col, categories in catmap.iteritems(): df[col] = pd.Categorical(categories[df[col]]) df[col].cat.categories = categories return df def mergecat(left, right, *args, **kwargs): left, left_catmap = decat(left) right, right_catmap = decat(right) left_catmap.update(right_catmap) result = pd.merge(left, right, *args, **kwargs) return recat(result, left_catmap) result = mergecat(df1, df2, on='ID') result.info()

gives

 <class 'pandas.core.frame.DataFrame'> Int64Index: 20 entries, 0 to 19 Data columns (total 3 columns): ID 20 non-null int32 value1 20 non-null category value2 20 non-null category dtypes: category(2), int32(1) memory usage: 320.0 bytes

Mike claffey · Answer 3 · 2016-02-06T20:57:32+0000

Here is the code snippet for recovering category metadata:

 def copy_category_metadata(df_with_categories, df_without_categories): import pandas for col_name, dtype in df_with_categories.dtypes.iteritems(): if str(dtype)=="category": if col_name in df_without_categories.columns: if str(df_without_categories[col_name].dtype)=="category": print "{} - Already a category".format(col_name) else: print "{} - Making a category".format(col_name) # make the column into a Categorical using the other dataframe metadata df_without_categories[col_name] = pandas.Categorical( df_without_categories[col_name], categories = df_with_categories[col_name].cat.categories, ordered = df_with_categories[col_name].cat.ordered)

Usage example:

 dfA # some data frame with categories dfB # another data frame df_merged = dfA.merge(dfB) # merge result, no categories copy_category_metadata(dfA, df_merged)

Python pandas: merge loses categorical columns

More articles: