Word Frequency in text using Python, but ignore stop words

Question

Word Frequency in text using Python, but ignore stop words

This gives me the frequency of words in the text:

fullWords = re.findall(r'\w+', allText) d = defaultdict(int) for word in fullWords : d[word] += 1 finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True) self.response.out.write(finalFreq)

It also gives me useless words like "the" "a" "a"

My question is, does python have a stop word library that can remove all these common words? I want to run this on Google App Engine

+4

python google-app-engine word-frequency

demos Jul 04 '10 at 3:06

source share

4 answers

There is an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):

 stopWords = set(['a', 'an', 'the', ...]) fullWords = re.findall(r'\w+', allText) d = defaultdict(int) for word in fullWords: if word not in stopWords: d[word] += 1 finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True) self.response.out.write(finalFreq)

This approach creates a sorted list in two steps: first, it filters out any words in the desired stop word list (which was converted to set for efficiency), then sorts the remaining entries.

+3

David z Jul 04 '10 at 3:19

source share

I know that NLTK has a package with a case and stop words for many languages, including English, see here for more information. NLTK also has a word frequency counter, it is a good natural language processing module that you should use.

+2

Tarantula Jul 04 '10 at 3:45

source share

 stopwords = set(['an', 'a', 'the']) # etc... finalFreq = sorted((k,v) for k,v in d.iteritems() if k not in stopwords, key = operator.itemgetter(1), reverse=True)

This will filter out any keys that are in stopwords .

0

Amber Jul 04 '10 at 3:19

source share

Alex martelli · Accepted Answer · 2010-07-04T03:25:20+0000

You can upload stop-word lists as files in various formats, for example. from here - all Python needs is to read the file (and they are in csv format, easy to read using the csv module), create a set and use membership in this set (possibly with some normalization, for example, the lower scale), to exclude words from the score.

Word Frequency in text using Python, but ignore stop words

More articles: