Word Frequency in text using Python, but ignore stop words

This gives me the frequency of words in the text:

fullWords = re.findall(r'\w+', allText) d = defaultdict(int) for word in fullWords : d[word] += 1 finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True) self.response.out.write(finalFreq) 

It also gives me useless words like "the" "a" "a"

My question is, does python have a stop word library that can remove all these common words? I want to run this on Google App Engine

+4
python google-app-engine word-frequency
source share
4 answers

You can upload stop-word lists as files in various formats, for example. from here - all Python needs is to read the file (and they are in csv format, easy to read using the csv module), create a set and use membership in this set (possibly with some normalization, for example, the lower scale), to exclude words from the score.

+5
source share

There is an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):

 stopWords = set(['a', 'an', 'the', ...]) fullWords = re.findall(r'\w+', allText) d = defaultdict(int) for word in fullWords: if word not in stopWords: d[word] += 1 finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True) self.response.out.write(finalFreq) 

This approach creates a sorted list in two steps: first, it filters out any words in the desired stop word list (which was converted to set for efficiency), then sorts the remaining entries.

+3
source share

I know that NLTK has a package with a case and stop words for many languages, including English, see here for more information. NLTK also has a word frequency counter, it is a good natural language processing module that you should use.

+2
source share
 stopwords = set(['an', 'a', 'the']) # etc... finalFreq = sorted((k,v) for k,v in d.iteritems() if k not in stopwords, key = operator.itemgetter(1), reverse=True) 

This will filter out any keys that are in stopwords .

0
source share

All Articles