Pandas - merge nearly duplicate rows based on column value

I have a pandas dataframe with several rows that are next to duplicates of each other, except for one value. My goal is to combine or "combine" these lines into one line without adding up the numerical values.

Here is an example of what I'm working with:

 Name Sid Use_Case Revenue A xx01 Voice $10.00 A xx01 SMS $10.00 B xx02 Voice $5.00 C xx03 Voice $15.00 C xx03 SMS $15.00 C xx03 Video $15.00 

And here is what I would like:

 Name Sid Use_Case Revenue A xx01 Voice, SMS $10.00 B xx02 Voice $5.00 C xx03 Voice, SMS, Video $15.00 

The reason I don’t want to summarize the β€œIncome” column is because my table is the result of a rotation over several time periods when the β€œIncome” simply ends up being listed several times instead of having a different value for β€œ Use_Case ".

What would be the best way to solve this problem? I looked at the groupby() function, but I still don't understand it very well.

+15
source share
5 answers

I think you can use groupby with aggregate first and the custom function ', '.join :

 df = df.groupby('Name').agg({'Sid':'first', 'Use_Case': ', '.join, 'Revenue':'first' }).reset_index() #change column order print df[['Name','Sid','Use_Case','Revenue']] Name Sid Use_Case Revenue 0 A xx01 Voice, SMS $10.00 1 B xx02 Voice $5.00 2 C xx03 Voice, SMS, Video $15.00 

Good idea from the comment, thanks to Goyo :

 df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index() #change column order print df[['Name','Sid','Use_Case','Revenue']] Name Sid Use_Case Revenue 0 A xx01 Voice, SMS $10.00 1 B xx02 Voice $5.00 2 C xx03 Voice, SMS, Video $15.00 
+22
source

I used some code that I did not consider optimal, and eventually found jezrael answer . But after using and running the timeit test timeit I really returned to what I was doing, namely:

 cmnts = {} for i, row in df.iterrows(): while True: try: if row['Use_Case']: cmnts[row['Name']].append(row['Use_Case']) else: cmnts[row['Name']].append('n/a') break except KeyError: cmnts[row['Name']] = [] df.drop_duplicates('Name', inplace=True) df['Use_Case'] = ['; '.join(v) for v in cmnts.values()] 

According to my test 100 t timeit the iteration and replacement method is an order of magnitude faster than the groupby method.

 import pandas as pd from my_stuff import time_something df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)], 'b': [i for i in range(1, 10001)]}) runs = 100 interim_dict = 'txt = {}\n' \ 'for i, row in df.iterrows():\n' \ ' try:\n' \ " txt[row['a']].append(row['b'])\n\n" \ ' except KeyError:\n' \ " txt[row['a']] = []\n" \ "df.drop_duplicates('a', inplace=True)\n" \ "df['b'] = ['; '.join(v) for v in txt.values()]" grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()" print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals())) print(time_something(grouping, runs, beg_string='Group By', glbls=globals())) 

gives:

 Interim Dict Total: 59.1164s Avg: 591163748.5887ns Group By Total: 430.6203s Avg: 4306203366.1827ns 

where time_something is a function that multiplies a fragment with timeit and returns the result in the above format.

+2
source

You can groupby and apply use the list function:

 >>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index() Name Sid Revenue 0 0 A xx01 $10.00 [Voice, SMS] 1 B xx02 $5.00 [Voice] 2 C xx03 $15.00 [Voice, SMS, Video] 

(If you are worried about duplicates, use set instead of list .)

+1
source

subsequent request: I am new to pandas and I need to merge similar rows, however in my case I have more than one column that is not similar and needs to be joined

I really liked Goyo's solution , is there a similar solution for my case?

0
source

How can I join a use_case column in a dictionary form? That is, I want the result in the form

 Name Sid Use_Case Revenue A xx01 {Voice:1, SMS:1} $10.00 B xx02 {Voice:1} $5.00 C xx03 {Voice:1, SMS:1, Video:1} $15.00 
0
source

All Articles