If you just need a word count frequency, you donβt need to create nltk.Text objects or even use nltk.PlainTextReader . Instead, just go straight to nltk.FreqDist .
files = list_of_files fd = nltk.FreqDist() for file in files: with open(file) as f: for sent in nltk.sent_tokenize(f.lower()): for word in nltk.word_tokenize(sent): fd.inc(word)
Or, if you do not want to analyze, just use a dict .
files = list_of_files fd = {} for file in files: with open(file) as f: for sent in nltk.sent_tokenize(f.lower()): for word in nltk.word_tokenize(sent): try: fd[word] = fd[word]+1 except KeyError: fd[word] = 1
This can be done much more efficiently with generator expressions, but I use for loops for readability.
source share