Python pandas: merge loses categorical columns

I work with large DataFrames categorical data, and I found that when I use pandas.merge on two data frames, any columns of categorical data are automatically promoted to a higher data type. (This can significantly increase RAM consumption.) A simple example to illustrate:

EDIT: made a more suitable example

import pandas import numpy df1 = pandas.DataFrame( {'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0], 'value1': pandas.Categorical(numpy.random.randint(0, 2, 20))}) df2 = pandas.DataFrame( {'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1], 'value2': pandas.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])}) result = pandas.merge(df1, df2, on="ID") result.dtypes Out []: ID int32 value1 int64 value2 object dtype: object 

I would like value1 and value2 to remain categorical as a result of a DataFrame. Converting string labels to an object type can be especially expensive.

From https://github.com/pydata/pandas/issues/8938 can this be as intended? Is there anyway to avoid this?

+5
source share
3 answers

I may be missing a goal, but you intend to convert the user into a category (or not) if necessary. I think that in this particular case this can be done automatically. Honestly, categorical transformations would be made in the end anyway, so that it really wonโ€™t save you anything (by doing this inside the merge).

 In [57]: result = pandas.merge(df1, df2, on="ID") In [58]: result['value1'] = result['value1'].astype('category') In [59]: result['value2'] = result['value2'].astype('category') In [60]: result Out[60]: ID value1 value2 0 5 0 c 1 5 1 c 2 3 0 a 3 3 1 a 4 6 0 c 5 6 0 c 6 7 0 a 7 7 1 a 8 0 1 c 9 0 1 c 10 4 1 b 11 4 1 b 12 8 0 b 13 8 1 b 14 2 1 a 15 2 1 a 16 9 0 a 17 9 1 a 18 1 0 b 19 1 1 b In [61]: result.dtypes Out[61]: ID int64 value1 category value2 category dtype: object In [62]: result.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 20 entries, 0 to 19 Data columns (total 3 columns): ID 20 non-null int64 value1 20 non-null category value2 20 non-null category dtypes: category(2), int64(1) memory usage: 400.0 byte 
+1
source

As a workaround, you can convert categorical columns to integer code values, and save the display of columns in categories in a dict. For instance,

 def decat(df): """ Convert categorical columns to (integer) codes; return the categories in catmap """ catmap = dict() for col, dtype in df.dtypes.iteritems(): if com.is_categorical_dtype(dtype): c = df[col].cat catmap[col] = c.categories df[col] = c.codes return df, catmap In [304]: df Out[304]: ID value2 0 5 c 1 3 a 2 6 c 3 7 a 4 0 c 5 4 b 6 8 b 7 2 a 8 9 a 9 1 b In [305]: df, catmap = decat(df) In [306]: df Out[306]: ID value2 0 5 2 1 3 0 2 6 2 3 7 0 4 0 2 5 4 1 6 8 1 7 2 0 8 9 0 9 1 1 In [307]: catmap Out[307]: {'value2': Index([u'a', u'b', u'c'], dtype='object')} 

Now you can concatenate as usual, since there is no problem combining columns with integers.

Later, you can re-compile categorical columns using the data in catmap :

 def recat(df, catmap): """ Use catmap to reconstitute columns in df to categorical dtype """ for col, categories in catmap.iteritems(): df[col] = pd.Categorical(categories[df[col]]) df[col].cat.categories = categories return df 

 import numpy as np import pandas as pd import pandas.core.common as com df1 = pd.DataFrame( {'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0], dtype='int32'), 'value1': pd.Categorical(np.random.randint(0, 2, 20))}) df2 = pd.DataFrame( {'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1], dtype='int32'), 'value2': pd.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])}) def decat(df): """ Convert categorical columns to (integer) codes; return the categories in catmap """ catmap = dict() for col, dtype in df.dtypes.iteritems(): if com.is_categorical_dtype(dtype): c = df[col].cat catmap[col] = c.categories df[col] = c.codes return df, catmap def recat(df, catmap): """ Use catmap to reconstitute columns in df to categorical dtype """ for col, categories in catmap.iteritems(): df[col] = pd.Categorical(categories[df[col]]) df[col].cat.categories = categories return df def mergecat(left, right, *args, **kwargs): left, left_catmap = decat(left) right, right_catmap = decat(right) left_catmap.update(right_catmap) result = pd.merge(left, right, *args, **kwargs) return recat(result, left_catmap) result = mergecat(df1, df2, on='ID') result.info() 

gives

 <class 'pandas.core.frame.DataFrame'> Int64Index: 20 entries, 0 to 19 Data columns (total 3 columns): ID 20 non-null int32 value1 20 non-null category value2 20 non-null category dtypes: category(2), int32(1) memory usage: 320.0 bytes 
0
source

Here is the code snippet for recovering category metadata:

 def copy_category_metadata(df_with_categories, df_without_categories): import pandas for col_name, dtype in df_with_categories.dtypes.iteritems(): if str(dtype)=="category": if col_name in df_without_categories.columns: if str(df_without_categories[col_name].dtype)=="category": print "{} - Already a category".format(col_name) else: print "{} - Making a category".format(col_name) # make the column into a Categorical using the other dataframe metadata df_without_categories[col_name] = pandas.Categorical( df_without_categories[col_name], categories = df_with_categories[col_name].cat.categories, ordered = df_with_categories[col_name].cat.ordered) 

Usage example:

 dfA # some data frame with categories dfB # another data frame df_merged = dfA.merge(dfB) # merge result, no categories copy_category_metadata(dfA, df_merged) 
0
source

Source: https://habr.com/ru/post/1216192/


All Articles