Raman combining burst in pandas

I am trying to combine a series of data in pandas. I have a list of dfs, dfs and a list of their respective labels labels , and I want to combine all dfs into 1 df, so that the common labels from df will get the suffix from its label to the labels list. i.e:.

 def mymerge(dfs, labels): labels_dict = dict([(d, l) for d, l in zip(dfs, labels)]) merged_df = reduce(lambda x, y: pandas.merge(x, y, suffixes=[labels_dict[x], labels_dict[y]]), dfs) return merged_df 

When I try to do this, I get an error message:

 pandas.tools.merge.MergeError: Combinatorial explosion! (boom) 

I am trying to make a series of merges, which at each merge grow no more than the number of columns N, where N is the number of columns in the "next" df in the list. The last DF should have as many columns as all the df columns added together, so they will grow additively and not be combinatorial.

The behavior I'm looking for is: Attach dfs to the column names that are specified (for example, indicated by on= ), or index dfs indexes. Decompose non-common column names (as in the outer join). If a column appears in multiple dfs files, optionally overwrite it. If you look more at the docs, it seems that update might be the best way to do this. Although, when I try to join='outer' , it raises an exception message that it is not implemented.

EDIT :

Here is my attempt at implementing this, which does not handle suffixes, but illustrates the kind of merge I'm looking for:

 def my_merge(dfs_list, on): """ list of dfs, columns to merge on. """ my_df = dfs_list[0] for right_df in dfs_list[1:]: # Only put the columns from the right df # that are not in the existing combined df (ie new) # or which are part of the columns to join on new_noncommon_cols = [c for c in right_df \ if (c not in my_df.columns) or \ (c in on)] my_df = pandas.merge(my_df, right_df[new_noncommon_cols], left_index=True, right_index=True, how="outer", on=on) return my_df 

This suggests that the merge occurs at the indices of each of the dfs. New columns are added in the outer join style, but columns that are shared (and not part of the index) are used in the join using the on= keyword.

Example:

 df1 = pandas.DataFrame([{"employee": "bob", "gender": "male", "bob_id1": "a"}, {"employee": "john", "gender": "male", "john_id1": "x"}]) df1 = df1.set_index("employee") df2 = pandas.DataFrame([{"employee": "mary", "gender": "female", "mary_id1": "c"}, {"employee": "bob", "gender": "male", "bob_id2": "b"}]) df2 = df2.set_index("employee") df3 = pandas.DataFrame([{"employee": "mary", "gender": "female", "mary_id2": "d"}]) df3 = df3.set_index("employee") merged = my_merge([df1, df2, df3], on=["gender"]) print "MERGED: " print merged 

A twist on this will be one where you randomly mark the suffix for each df based on a set of labels for columns that are common, but this is less important. Is the merge operation described above more elegant in pandas or already existing as inline?

+8
python numpy scipy pandas
source share
2 answers

The result of your method:

 In [29]: merged Out[29]: bob_id1 gender john_id1 bob_id2 mary_id1 mary_id2 employee bob a male NaN b NaN NaN john NaN male x NaN NaN NaN mary NaN female NaN NaN cd 

Solution with pandas built-in df.combine_first :

 In [28]: reduce(lambda x,y: x.combine_first(y), [df1, df2, df3]) Out[28]: bob_id1 bob_id2 gender john_id1 mary_id1 mary_id2 employee bob ab male NaN NaN NaN john NaN NaN male x NaN NaN mary NaN NaN female NaN cd 

To add a suffix to the columns of each frame, I would suggest renaming the columns before calling combine_first .

On the other hand, you might want to look at an operation of type pd.concat([df1, df2, df3], keys=['d1', 'd2', 'd3'], axis=1) , which creates a data block with MultiIndex columns . In this case, you might want to consider creating a gender part of the index or living with it duplication.

+5
source share

from source code:

 max_groups = 1L for x in group_sizes: max_groups *= long(x) if max_groups > 2**63: # pragma: no cover raise Exception('Combinatorial explosion! (boom)') 

And in the same file

 # max groups = largest possible number of distinct groups left_key, right_key, max_groups = self._get_group_keys() 

The string max_groups *= long(x) indicates that it is not additive, therefore critical.

+1
source share

All Articles