Print the 10 most frequently occurring words in the text, including and excluding stop words

Question

Print the 10 most frequently occurring words in the text, including and excluding stop words

I got a question from here with my changes. I have the following code:

from nltk.corpus import stopwords >>> def content_text(text): stopwords = nltk.corpus.stopwords.words('english') content = [w for w in text if w.lower() in stopwords] return content

How can I print the 10 most frequently occurring words in a text that are 1) including and 2) excluding stop words?

+5

python nltk word-frequency find-occurrences

user2064809 Feb 08 '15 at 10:22

source share

3 answers

There is a FreqDist function in nltk

 import nltk allWords = nltk.tokenize.word_tokenize(text) allWordDist = nltk.FreqDist(w.lower() for w in allWords) stopwords = nltk.corpus.stopwords.words('english') allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)

to extract the 10 most common:

 mostCommon= allWordDist.most_common(10).keys()

+7

igorushi Feb 08 '15 at 11:15

source share

You can try the following:

 for word, frequency in allWordsDist.most_common(10): print('%s;%d' % (word, frequency)).encode('utf-8')

+1

prahlad Apr 28 '16 at 6:51

source share

Padraic cunningham · Accepted Answer · 2015-02-08T10:32:00+0000

Not sure is stopwords function is stopwords in the function, I suppose it should be in , but you can use the Contdisk with most_common(10) to get the 10 most frequent ones:

 from collections import Counter from string import punctuation def content_text(text): stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups with_stp = Counter() without_stp = Counter() with open(text) as f: for line in f: spl = line.split() # update count off all words in the line that are in stopwrods with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords) # update count off all words in the line that are not in stopwords without_stp.update(w.lower().rstrip(punctuation) for w in spl if w not in stopwords) # return a list with top ten most common words from each return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)] wth_stop, wthout_stop = content_text(...)

If you pass the nltk file object, just iterate over it:

 def content_text(text): stopwords = set(nltk.corpus.stopwords.words('english')) with_stp = Counter() without_stp = Counter() for word in text: # update count off all words in the line that are in stopwords word = word.lower() if word in stopwords: with_stp.update([word]) else: # update count off all words in the line that are not in stopwords without_stp.update([word]) # return a list with top ten most common words from each return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)] print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

The nltk method includes punctuation characters, which may not be what you want.

Print the 10 most frequently occurring words in the text, including and excluding stop words

More articles: