I have a pandas framework with a text column. Now I want to group this framework and combine text columns - here is some code to generate an example dataframe:
import numpy as np
import pandas as pd
import string
import random
def text_generator(size=6, chars=string.ascii_lowercase):
return ''.join(random.choice(chars) for _ in range(size))
items, clusters, texts = [], [], []
for item in range(200):
for cluster in range(1000):
for line in range(random.randint(1, 4)):
items.append(item)
clusters.append(cluster)
texts.append(text_generator())
df = pd.DataFrame({'item_id': items, 'cluster_id': clusters, 'text': texts})
Now I group by the columns "item_id" and "cluster_id" and create a new data frame for the aggregated result:
grouped = df.groupby(('item_id', 'cluster_id'))
df_cluster = pd.DataFrame(grouped.size()).rename(columns={0: 'cluster_size'})
I may be wrong, but the obvious solution seems to be to fill in the text as follows:
df_cluster['texts'] = grouped.text.agg(lambda x: ' '.join(x))
But it takes about 10 seconds. For a few megabytes of data? Weird So I tested the standard python solution for this:
text_lookup = {}
for item_id, cluster_id, text in zip(df.item_id.values, df.cluster_id.values, df.text.values):
text_lookup.setdefault((item_id, cluster_id), []).append(text)
item_ids, cluster_ids, all_texts = [], [], []
for (item_id, cluster_id), texts in text_lookup.items():
item_ids.append(item_id)
cluster_ids.append(cluster_id)
all_texts.append(' '.join([t for t in texts if t is not np.nan]))
df_tags = pd.DataFrame({'item_id': item_ids, 'cluster_id': cluster_ids, 'texts': all_texts}).set_index(['item_id', 'cluster_id'])
df_cluster = df_cluster.merge(df_tags, left_index=True, right_index=True)
This should be much slower because I am doing all this for loops in python, but it only takes 3 seconds. I’m probably doing something wrong, but now I don’t know what :).