Print the 10 most frequently occurring words in the text, including and excluding stop words

I got a question from here with my changes. I have the following code:

from nltk.corpus import stopwords >>> def content_text(text): stopwords = nltk.corpus.stopwords.words('english') content = [w for w in text if w.lower() in stopwords] return content 

How can I print the 10 most frequently occurring words in a text that are 1) including and 2) excluding stop words?

+5
source share
3 answers

Not sure is stopwords function is stopwords in the function, I suppose it should be in , but you can use the Contdisk with most_common(10) to get the 10 most frequent ones:

 from collections import Counter from string import punctuation def content_text(text): stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups with_stp = Counter() without_stp = Counter() with open(text) as f: for line in f: spl = line.split() # update count off all words in the line that are in stopwrods with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords) # update count off all words in the line that are not in stopwords without_stp.update(w.lower().rstrip(punctuation) for w in spl if w not in stopwords) # return a list with top ten most common words from each return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)] wth_stop, wthout_stop = content_text(...) 

If you pass the nltk file object, just iterate over it:

 def content_text(text): stopwords = set(nltk.corpus.stopwords.words('english')) with_stp = Counter() without_stp = Counter() for word in text: # update count off all words in the line that are in stopwords word = word.lower() if word in stopwords: with_stp.update([word]) else: # update count off all words in the line that are not in stopwords without_stp.update([word]) # return a list with top ten most common words from each return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)] print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt'))) 

The nltk method includes punctuation characters, which may not be what you want.

+4
source

There is a FreqDist function in nltk

 import nltk allWords = nltk.tokenize.word_tokenize(text) allWordDist = nltk.FreqDist(w.lower() for w in allWords) stopwords = nltk.corpus.stopwords.words('english') allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords) 

to extract the 10 most common:

 mostCommon= allWordDist.most_common(10).keys() 
+7
source

You can try the following:

 for word, frequency in allWordsDist.most_common(10): print('%s;%d' % (word, frequency)).encode('utf-8') 
+1
source

Source: https://habr.com/ru/post/1212892/


All Articles