Python: best / effective way to find a list of words in text?

I have a list of approximately 300 words and a huge amount of text that I want to scan to find out how many times each word appears.

I am using re module from python:

for word in list_word: search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word) occurrences = search.subn("", text)[1] 

but I want to know if there is a more efficient or more elegant way to do this?

+6
python regex
source share
8 answers

If you have a huge amount of text, I would not use regular expressions in this case, but simply split the text:

 words = {"this": 0, "that": 0} for w in text.split(): if w in words: words[w] += 1 
The words

will give you a frequency for each word

+5
source share

Try to remove all punctuation from your text, and then divide it into spaces. Then just do

 for word in list_word: occurence = strippedText.count(word) 

Or, if you are using python 3.0, I think you could do:

 occurences = {word: strippedText.count(word) for word in list_word} 
+1
source share

Googling: python frequency gives me this page as the first result: http://www.daniweb.com/code/snippet216747.html

It seems that you are looking.

0
source share

You can also break the text into words and perform a search in the resulting list.

0
source share

Regular expressions may not be what you want. Python has a number of built-in string operations that are much faster, and I believe .count () has what you need.

http://docs.python.org/library/stdtypes.html#string-methods

0
source share

If Python is optional, you can use awk

 $ cat file word1 word2 word3 word4 $ cat file1 blah1 blah2 word1 word4 blah3 word2 junk1 junk2 word2 word1 junk3 blah4 blah5 word3 word6 end $ awk 'FNR==NR{w[$1];next} {for(i=1;i<=NF;i++) a[$i]++}END{for(i in w){ if(i in a) print i,a[i] } } ' file file1 word1 2 word2 2 word3 1 word4 1 
0
source share

It sounds as though the Natural Language Toolkit might have what you need.

http://www.nltk.org/

0
source share

Perhaps you could adapt this function of the multi-scan generator.

  from itertools import islice testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5." def multis(search_sequence,text,start=0): """ multisearch by given search sequence values from text, starting from position start yielding tuples of text before sequence item and found sequence item""" x='' for ch in text[start:]: if ch in search_sequence: if x: yield (x,ch) else: yield ch x='' else: x+=ch else: if x: yield x # split the first two sentences by the dot/question/exclamation. two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation print "result of split: ", two_sentences print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences) 
0
source share

All Articles