Starting from the following data frame df:
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
I am trying to build a structure:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1, 1]
thus basically grouping the elements langand frequency per node into one line through lists. What i have done so far:
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
and then merge into node:
d = pd.merge(a,c,on='node')
After these operations, I got the following:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1]
As you can see, the last line has only one [1]occurrence of a frequency of two [uz, es]instead of a list [1,1], as expected. Is there a way to perform the analysis in a more concise way to get the desired result?