Using document length in the Naive Bayes classifier for NLTK Python

I am creating a spam filter using NLTK in Python. Now I check for words and use the NaiveBayesClassifier, which gives an accuracy of 0.98 and a measure F for spam of 0.92 and for non-spam: 0.98. However, when checking documents in which my software errors I notice that many spam that are classified as non-spam are very short messages.

So I want to put the length of the document as a function for NaiveBayesClassifier. The problem is that now it only processes binary values. Is there any other way to do this, for example, for example: length <100 = true / false?

(ps I created a spam detector similar to http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html )

+2
source share
2 answers

The NLTK Naive Bayes implementation does not, but you can combine the NaiveBayesClassifier predictions with the distribution along the length of the document. The NLTK prob_classify method will give you a conditional probability distribution by classes, given the words in the document, i.e. P (cl | doc). What you want is P (cl | doc, len) - the probability of the class, given the words in the document and its length. If we make a few more assumptions about independence, we get:

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len) = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len)) = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len) = P(cl|doc) * P(len|cl) / P(len) 

You already have the first term from prob_classify, so all that remains to be done is evaluate P (len | cl) and P (len).

You can get as fantastic as you want when it comes to document document lengths, but for starters you can simply assume that document length magazines are usually distributed. If you know the mean and standard deviation of the length of the journal document in each class and in general, then it is easy to calculate P (len | cl) and P (len).

Here is one way to evaluate P (len):

 from nltk.corpus import movie_reviews from math import sqrt,log import scipy loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()] sd = sqrt(scipy.var(loglens)) mu = scipy.mean(loglens) p = scipy.stats.norm(mu,sd) 

The only thing to remember is that it is a distribution along the length of the log, not the length, and that it is a continuous distribution. Thus, the probability of a document of length L will be:

 p.cdf(log(L+1)) - p.cdf(log(L)) 

The distribution of conditional lengths can be estimated in the same way, using the lengths of document journals in each class. This should give you what you need for P (cl | doc, len).

+3
source

There are MultiNomial NaiveBayes algorithms that can handle range values, but are not implemented in NLTK. For NLTK NaiveBayesClassifier, you can try to have several threshold lengths as binary functions. I would also suggest trying the Maxent Classifier to see how it handles smaller text.

+3
source

All Articles