How to concatenate text in pandas groupby

I have a pandas framework with a text column. Now I want to group this framework and combine text columns - here is some code to generate an example dataframe:

import numpy as np
import pandas as pd

import string
import random

def text_generator(size=6, chars=string.ascii_lowercase):
    return ''.join(random.choice(chars) for _ in range(size))

items, clusters, texts = [], [], []
for item in range(200):
    for cluster in range(1000):
        for line in range(random.randint(1, 4)):
            items.append(item)
            clusters.append(cluster)
            texts.append(text_generator())
df = pd.DataFrame({'item_id': items, 'cluster_id': clusters, 'text': texts})

Now I group by the columns "item_id" and "cluster_id" and create a new data frame for the aggregated result:

grouped = df.groupby(('item_id', 'cluster_id'))
df_cluster = pd.DataFrame(grouped.size()).rename(columns={0: 'cluster_size'})

I may be wrong, but the obvious solution seems to be to fill in the text as follows:

df_cluster['texts'] = grouped.text.agg(lambda x: ' '.join(x))

But it takes about 10 seconds. For a few megabytes of data? Weird So I tested the standard python solution for this:

text_lookup = {}
for item_id, cluster_id, text in zip(df.item_id.values, df.cluster_id.values, df.text.values):
    text_lookup.setdefault((item_id, cluster_id), []).append(text)
item_ids, cluster_ids, all_texts = [], [], []
for (item_id, cluster_id), texts in text_lookup.items():
    item_ids.append(item_id)
    cluster_ids.append(cluster_id)
    all_texts.append(' '.join([t for t in texts if t is not np.nan]))
df_tags = pd.DataFrame({'item_id': item_ids, 'cluster_id': cluster_ids, 'texts': all_texts}).set_index(['item_id', 'cluster_id'])
df_cluster = df_cluster.merge(df_tags, left_index=True, right_index=True)

This should be much slower because I am doing all this for loops in python, but it only takes 3 seconds. I’m probably doing something wrong, but now I don’t know what :).

+4

:

75
? ?

:

5116
, ?
4268
?
3790
?
3474
?
3428
?
3235
, ?
1553
pandas

All Articles