How to speed up word count in nltk plaintextcorpus?

Question

How to speed up word count in nltk plaintextcorpus?

I have a set of documents, and I want to return a list of tuples, where each tuple has the date of this document and the number of times that this search term appears in this document. My code (below) is working, but slow, and I'm n00b. Are there any obvious ways to do this faster? Any help would be greatly appreciated, mainly so that I could study the encoding better, but also so that I could do this project faster!

def searchText(searchword): counts = [] corpus_root = 'some_dir' wordlists = PlaintextCorpusReader(corpus_root, '.*') for id in wordlists.fileids(): date = id[4:12] month = date[-4:-2] day = date[-2:] year = date[:4] raw = wordlists.raw(id) tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) count = text.count(searchword) counts.append((month, day, year, count)) return counts

+4

python nlp nltk corpus

Mark bellhorn Oct 10 '10 at 20:25

source share

1 answer

Tim McNamara · Accepted Answer · 2010-10-10T21:54:20+0000

If you just need a word count frequency, you don’t need to create nltk.Text objects or even use nltk.PlainTextReader . Instead, just go straight to nltk.FreqDist .

 files = list_of_files fd = nltk.FreqDist() for file in files: with open(file) as f: for sent in nltk.sent_tokenize(f.lower()): for word in nltk.word_tokenize(sent): fd.inc(word)

Or, if you do not want to analyze, just use a dict .

 files = list_of_files fd = {} for file in files: with open(file) as f: for sent in nltk.sent_tokenize(f.lower()): for word in nltk.word_tokenize(sent): try: fd[word] = fd[word]+1 except KeyError: fd[word] = 1

This can be done much more efficiently with generator expressions, but I use for loops for readability.

How to speed up word count in nltk plaintextcorpus?

More articles: