Edit
features['contains(%s)' % word] = (word in document_words)
to
features[word] = (word in document)
Otherwise, the classifier knows only about the "words" of the form "contains (...)" and therefore does not know the words in "i love this city"
import nltk.tokenize as tokenize import nltk import random random.seed(3) def bag_of_words(words): return dict([word, True] for word in words) def document_features(document): features = {} for word in word_features: features[word] = (word in document)
gives
Most Informative Features worst = True neg : pos = 15.5 : 1.0 ridiculous = True neg : pos = 11.5 : 1.0 batman = True neg : pos = 7.6 : 1.0 drive = True neg : pos = 7.6 : 1.0 blame = True neg : pos = 7.6 : 1.0 terrible = True neg : pos = 6.9 : 1.0 rarely = True pos : neg = 6.4 : 1.0 cliches = True neg : pos = 6.0 : 1.0 $ = True pos : neg = 5.9 : 1.0 perfectly = True pos : neg = 5.5 : 1.0 probability 'love' is positive: 61.52% probability 'hate' is positive: 36.71% i love this city => pos i hate this city => neg
unutbu
source share