The NLTK Naive Bayes implementation does not, but you can combine the NaiveBayesClassifier predictions with the distribution along the length of the document. The NLTK prob_classify method will give you a conditional probability distribution by classes, given the words in the document, i.e. P (cl | doc). What you want is P (cl | doc, len) - the probability of the class, given the words in the document and its length. If we make a few more assumptions about independence, we get:
P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len) = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len)) = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len) = P(cl|doc) * P(len|cl) / P(len)
You already have the first term from prob_classify, so all that remains to be done is evaluate P (len | cl) and P (len).
You can get as fantastic as you want when it comes to document document lengths, but for starters you can simply assume that document length magazines are usually distributed. If you know the mean and standard deviation of the length of the journal document in each class and in general, then it is easy to calculate P (len | cl) and P (len).
Here is one way to evaluate P (len):
from nltk.corpus import movie_reviews from math import sqrt,log import scipy loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()] sd = sqrt(scipy.var(loglens)) mu = scipy.mean(loglens) p = scipy.stats.norm(mu,sd)
The only thing to remember is that it is a distribution along the length of the log, not the length, and that it is a continuous distribution. Thus, the probability of a document of length L will be:
p.cdf(log(L+1)) - p.cdf(log(L))
The distribution of conditional lengths can be estimated in the same way, using the lengths of document journals in each class. This should give you what you need for P (cl | doc, len).
rmalouf
source share