Calculate the most frequent 100 words of sentences in the Dataframe Pandas

Question

Calculate the most frequent 100 words of sentences in the Dataframe Pandas

I have text reviews in one column in the Pandas dataframe, and I want to count the N-most common words with their frequency number (in the whole column - NOT in one cell). One approach is to count words using a counter by repeating each line. Is there a better alternative?

Representative data.

0 a heartening tale of small victories and endu 1 no sophomore slump for director sam mendes w 2 if you are an actor who can relate to the sea 3 it this memory-as-identity obviation that g 4 boyd screenplay ( co-written with guardian

+8

python pandas

swati saoji Apr 27 '15 at 18:11

source share

2 answers

Along with @Joran's solution, you can also use series.value_counts for large volumes of text / lines

  pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]

You can find from the tests series.value_counts twice (2 times) faster than the Counter method

For a Movie Reviews dataset, 3,000 lines, just 400K characters and 70k words.

 In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100) 10 loops, best of 3: 44.2 ms per loop In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100] 10 loops, best of 3: 27.1 ms per loop

+14

Zero Apr 27 '15 at 19:21

source share

Joran beasley · Accepted Answer · 2015-04-27T18:15:59+0000

 Counter(" ".join(df["text"]).split()).most_common(100)

im will surely give you what you want (you may have to remove some non-words from the counter result before calling most_common)

Calculate the most frequent 100 words of sentences in the Dataframe Pandas

More articles: