I used some code that I did not consider optimal, and eventually found jezrael answer . But after using and running the timeit test timeit I really returned to what I was doing, namely:
cmnts = {} for i, row in df.iterrows(): while True: try: if row['Use_Case']: cmnts[row['Name']].append(row['Use_Case']) else: cmnts[row['Name']].append('n/a') break except KeyError: cmnts[row['Name']] = [] df.drop_duplicates('Name', inplace=True) df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
According to my test 100 t timeit the iteration and replacement method is an order of magnitude faster than the groupby method.
import pandas as pd from my_stuff import time_something df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)], 'b': [i for i in range(1, 10001)]}) runs = 100 interim_dict = 'txt = {}\n' \ 'for i, row in df.iterrows():\n' \ ' try:\n' \ " txt[row['a']].append(row['b'])\n\n" \ ' except KeyError:\n' \ " txt[row['a']] = []\n" \ "df.drop_duplicates('a', inplace=True)\n" \ "df['b'] = ['; '.join(v) for v in txt.values()]" grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()" print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals())) print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
gives:
Interim Dict Total: 59.1164s Avg: 591163748.5887ns Group By Total: 430.6203s Avg: 4306203366.1827ns
where time_something is a function that multiplies a fragment with timeit and returns the result in the above format.
source share