Calculate the most frequent 100 words of sentences in the Dataframe Pandas

I have text reviews in one column in the Pandas dataframe, and I want to count the N-most common words with their frequency number (in the whole column - NOT in one cell). One approach is to count words using a counter by repeating each line. Is there a better alternative?

Representative data.

0 a heartening tale of small victories and endu 1 no sophomore slump for director sam mendes w 2 if you are an actor who can relate to the sea 3 it this memory-as-identity obviation that g 4 boyd screenplay ( co-written with guardian 
+8
python pandas
source share
2 answers
 Counter(" ".join(df["text"]).split()).most_common(100) 

im will surely give you what you want (you may have to remove some non-words from the counter result before calling most_common)

+15
source share

Along with @Joran's solution, you can also use series.value_counts for large volumes of text / lines

  pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100] 

You can find from the tests series.value_counts twice (2 times) faster than the Counter method

For a Movie Reviews dataset, 3,000 lines, just 400K characters and 70k words.

 In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100) 10 loops, best of 3: 44.2 ms per loop In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100] 10 loops, best of 3: 27.1 ms per loop 
+14
source share

All Articles