Extracting a word set using Python / NLTK and then comparing it with a standard English dictionary

Question

Extracting a word set using Python / NLTK and then comparing it with a standard English dictionary

I have:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

which is straight from the NLTK manual. What I want to do next is to compare it vocabwith an exhaustive set of English words, such as OED, and make a difference - a set of Finnegans Wake words that don't and probably never will be in OED. I am a much more verbal person than a person oriented to mathematics, so I still have not figured out how to do this, and the manual contains too many details about things that I really do not want to do. However, I assume this is just one or two lines of code.

+5

python set text nlp nltk

magnetar Aug 6 '10 at 22:04

source share

1

Alex Martelli · Accepted Answer · 2010-08-06T22:41:41+0000

(, ),

set(vocab) - english_dictionary

, vocab, english_dictionary. ( , vocab sorted, , !).

- , , , ! -)

: OP , words (, vocab), englishwords (, english_dictionary), ,

newwords = set(words) - set(englishwords)

newwords = set(words).difference(englishwords)

- " , ". , , , ( "" "" ) , , ( englishwords - , , , "" difference - - " " ).

, sorted(newwords) (list(newwords) , , , , : -).

Extracting a word set using Python / NLTK and then comparing it with a standard English dictionary

More articles: