Length normalization in a naive Bayes classifier for documents

I am trying to implement a naive Bayes classifier to classify documents, which are essentially collections (in contrast to packages) of functions, i.e. each function contains a set of unique functions, each of which can be displayed no more than once in a document. For example, you can consider functions as unique keywords for documents.

I watched closely Rennie et al. Et al. Http://www.aaai.org/Papers/ICML/2003/ICML03-081.pdf , but I ran into a problem that did not seem to be addressed. Namely, the classification of short documents leads to much higher posterior probabilities due to documents with fewer attributes; on the contrary, for long documents.

This is because the rear probabilities are defined as (ignoring the denominator):

P(class|document) = P(class) * P(document|class) 

which expands to

 P(class|document) = P(class) * P(feature1|class) * ... * P(featureK|class) 

It is clear from this that shorter documents with fewer functions will have higher posterior probabilities simply because fewer members multiply together.

For example, suppose the functions foo, bar, and baz appear in positive training observations. Then a document with a single function "foo" will have a higher back chance of being classified in a positive class than a document with features {"foo", "bar", "baz"}. This seems controversial, but I'm not quite sure how to solve this.

Is there any normalization of length that can be done? One idea is to add the size of the document as a function, but this does not seem completely correct, since the results will be distorted by the size of the documents in the training data.

+4
source share
1 answer

That's a good question; Now I'm not quite sure there is a problem. Back probability simply gives you the probability that each class will receive a document (i.e. the Probabilities of each class of a document). Therefore, when you classify a document, you compare only the back faces with the same document, so the number of functions will not change (since you do not go through the documents), that is:

 P(class1|document) = P(class1) * P(feature1|class1) * ... * P(featureK|class1) ... P(classN|document) = P(classN) * P(feature1|classN) * ... * P(featureK|classN) 

The class with the highest backdating will be called the label for the document. Since the number of functions seems to depend on the document, and not on the class, there is no need to normalize.

Am I missing something? If you want to do something more than classify, for example. If you want to compare the most probable documents of a certain class, then you will need to use the actual definition of the back probabilities:

 P(class1|document) = P(class1) * P(feature1|class1) * ... * P(featureK|class1)/Sum_over_all_numerators 

And this is normalized correctly in documents of different lengths of elements.

+4
source

All Articles