Pandas - External DataFrame Extension

First of all, I am very new to pandas and I try to bow so that thorough answers are appreciated.

I want to generate a pandas DataFrame representing a map witter tag subtoken -> poster, where tag subtoken means something in the set {hashtagA} U {i | i in split('_', hashtagA)}from the table correspondingposter -> tweet

For instance:

In [1]: df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])

In [2]: df
Out[2]: 
      0                                     1
0   jim           i was like #yolo_omg to her
1  jack  You are so #yes_omg #best_place_ever
2  neil                     Yo #rofl_so_funny

And from this I want to get something like

      0          1
0   jim          yolo_omg
1   jim          yolo
2   jim          omg
3  jack          yes_omg
4  jack          yes
5  jack          omg
6  jack          best_place_ever
7  jack          best
8  jack          place
9  jack          ever
10 neil          rofl_so_funny
11 neil          rofl
12 neil          so
13 neil          funny

I managed to construct this insight that actually does the job:

In [143]: df[1].str.findall('#([^\s]+)') \
    .apply(pd.Series).stack() \
    .apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \
    .apply(pd.Series).stack().to_frame().reset_index(level=0) \
    .join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]

Out[143]: 
        0              0_l
0 0   jim         yolo_omg
  1   jim             yolo
  2   jim              omg
  0  jack          yes_omg
  1  jack              yes
  2  jack              omg
1 0  jack  best_place_ever
  1  jack             best
  2  jack            place
  3  jack             ever
0 0  neil    rofl_so_funny
  1  neil             rofl
  2  neil               so
  3  neil            funny

But I have a very strong feeling that there are much better ways to do this, especially considering that the set of real data sets is huge.

+4
source share
2 answers

pandas . Series.str.findall() (), .

, :

df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])

, , :

df.columns = ['user', 'tweet']

:

df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]], columns=['user', 'tweet'])

extract :

df['tag'] = df["tweet"].str.findall("(#[^ ]*)")

, , , .

+1

python, pandas? , , , .

import re
-
tags = [re.findall('#([^\s]+)', t) for t in df[1]]
subtokens
st = [[t] + [s.split('_') for s in t] for t in tags]
subtokens = [[i for s in poster for i in s] for poster in st]
DataFrame
df2 = pd.DataFrame(subtokens, index=df[0]).stack()

In [250]: df2
Out[250]: 
jim   0           yolo_omg
      1               yolo
      2                omg
jack  0            yes_omg
      1    best_place_ever
      2                yes
      3                omg
      4               best
      5              place
      6               ever
neil  0      rofl_so_funny
      1               rofl
      2                 so
      3              funny
dtype: object
0

All Articles