Pandas join without replacement

It's a little hard to explain, but I will try my best. Now I have two tables that I need to combine, but we really do not have a unique connection identifier. I have several columns to join this, this is the best I can do, and I just want to know when we do not have equal numbers on either side of the joins. Right now, if the correct table has 1 match with 2 records in the left table, then 1 corresponds to the joins of both records. This leaves me unaware that the right table has only 1 input versus 2 on the left.

I want to join the right table to the left (external), but I do not want to join the right table more than once per record. Therefore, if the right index of table 3 could be combined in index 1 and 2 on the left, I want it to be attached to index 1. Also, if index 3 and index 4 could be combined in indexes 1 and 2, I want so that index 1 corresponds to index 3 and index 2 corresponds to index 4. If there is only 1 match (index 1 → 3), but index 2 in the left table can be matched to index 3, I want index 2 to not connect.

Examples can best describe this:

a_df = pd.DataFrame.from_dict({1: {'match_id': 2, 'uniq_id': 1}, 2: {'match_id': 2, 'uniq_id': 2}}, orient='index') In [99]: a_df Out[99]: match_id uniq_id 1 2 1 2 2 2 In [100]: b_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 2, 'uniq_id': 4}}, orient='index') In [101]: b_df Out[101]: match_id uniq_id 3 2 3 4 2 4 

In this example, I want a_df to join b_df. I want b_df uniq_id 3 to match a_df uniq_id 1 and b_df 4 to a_df 2.

The result will look like this:

 Out[106]: match_id_right match_id uniq_id uniq_id_right 1 2 2 1 3 2 2 2 2 4 

Now suppose we want to join a_df to c_df:

 In [104]: c_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 3, 'uniq_id': 4}}, orient='index') In [105]: c_df Out[105]: match_id uniq_id 3 2 3 4 3 4 

In this case, we have match_ids 2 on a_df and only 1 match_id out of 2 on c_df.

In this case, I just want uniq_id 1 to match uniq_id 3, leaving uniq_id 2 and uniq_id 4 unsurpassed

  match_id_right match_id uniq_id uniq_id_right 1 2 2 1 3 2 NaN 2 2 NaN 4 3 NaN NaN 4 
+6
source share
1 answer

Ok guys, so the answer is actually pretty simple.

What you need to do is group each data frame (left, right) with the appropriate columns, and then add a new column column for each group.

Now you do an external join and turn on the counter column, so you will correspond to 0.1, but if the right one has 2, then it does not match. If the left margin has only 0, it will match the correct one, but if the right margin has 0.1, then the right record "1" does not match!

Edit: request code.

I have nothing convenient, but it is very simple. If you have, say, 2 columns that you match by ['amount', 'date'], then you just do

 left_df['Helper'] = left_df.groupby(['amount','date']).cumcount() right_df['RHelper'] = right_df.groupby(['amount','date']).cumcount() 

Then use the Assistant column in the connection.

+1
source

All Articles