First of all, I am very new to pandas and I try to bow so that thorough answers are appreciated.
I want to generate a pandas DataFrame representing a map witter tag subtoken -> poster, where tag subtoken means something in the set {hashtagA} U {i | i in split('_', hashtagA)}from the table correspondingposter -> tweet
For instance:
In [1]: df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
In [2]: df
Out[2]:
0 1
0 jim i was like
1 jack You are so
2 neil Yo
And from this I want to get something like
0 1
0 jim yolo_omg
1 jim yolo
2 jim omg
3 jack yes_omg
4 jack yes
5 jack omg
6 jack best_place_ever
7 jack best
8 jack place
9 jack ever
10 neil rofl_so_funny
11 neil rofl
12 neil so
13 neil funny
I managed to construct this insight that actually does the job:
In [143]: df[1].str.findall('#([^\s]+)') \
.apply(pd.Series).stack() \
.apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \
.apply(pd.Series).stack().to_frame().reset_index(level=0) \
.join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]
Out[143]:
0 0_l
0 0 jim yolo_omg
1 jim yolo
2 jim omg
0 jack yes_omg
1 jack yes
2 jack omg
1 0 jack best_place_ever
1 jack best
2 jack place
3 jack ever
0 0 neil rofl_so_funny
1 neil rofl
2 neil so
3 neil funny
But I have a very strong feeling that there are much better ways to do this, especially considering that the set of real data sets is huge.
source
share