I am trying to combine a series of data in pandas. I have a list of dfs, dfs and a list of their respective labels labels , and I want to combine all dfs into 1 df, so that the common labels from df will get the suffix from its label to the labels list. i.e:.
def mymerge(dfs, labels): labels_dict = dict([(d, l) for d, l in zip(dfs, labels)]) merged_df = reduce(lambda x, y: pandas.merge(x, y, suffixes=[labels_dict[x], labels_dict[y]]), dfs) return merged_df
When I try to do this, I get an error message:
pandas.tools.merge.MergeError: Combinatorial explosion! (boom)
I am trying to make a series of merges, which at each merge grow no more than the number of columns N, where N is the number of columns in the "next" df in the list. The last DF should have as many columns as all the df columns added together, so they will grow additively and not be combinatorial.
The behavior I'm looking for is: Attach dfs to the column names that are specified (for example, indicated by on= ), or index dfs indexes. Decompose non-common column names (as in the outer join). If a column appears in multiple dfs files, optionally overwrite it. If you look more at the docs, it seems that update might be the best way to do this. Although, when I try to join='outer' , it raises an exception message that it is not implemented.
EDIT :
Here is my attempt at implementing this, which does not handle suffixes, but illustrates the kind of merge I'm looking for:
def my_merge(dfs_list, on): """ list of dfs, columns to merge on. """ my_df = dfs_list[0] for right_df in dfs_list[1:]:
This suggests that the merge occurs at the indices of each of the dfs. New columns are added in the outer join style, but columns that are shared (and not part of the index) are used in the join using the on= keyword.
Example:
df1 = pandas.DataFrame([{"employee": "bob", "gender": "male", "bob_id1": "a"}, {"employee": "john", "gender": "male", "john_id1": "x"}]) df1 = df1.set_index("employee") df2 = pandas.DataFrame([{"employee": "mary", "gender": "female", "mary_id1": "c"}, {"employee": "bob", "gender": "male", "bob_id2": "b"}]) df2 = df2.set_index("employee") df3 = pandas.DataFrame([{"employee": "mary", "gender": "female", "mary_id2": "d"}]) df3 = df3.set_index("employee") merged = my_merge([df1, df2, df3], on=["gender"]) print "MERGED: " print merged
A twist on this will be one where you randomly mark the suffix for each df based on a set of labels for columns that are common, but this is less important. Is the merge operation described above more elegant in pandas or already existing as inline?