I am trying to implement a naive Bayes classifier to classify documents, which are essentially collections (in contrast to packages) of functions, i.e. each function contains a set of unique functions, each of which can be displayed no more than once in a document. For example, you can consider functions as unique keywords for documents.
I watched closely Rennie et al. Et al. Http://www.aaai.org/Papers/ICML/2003/ICML03-081.pdf , but I ran into a problem that did not seem to be addressed. Namely, the classification of short documents leads to much higher posterior probabilities due to documents with fewer attributes; on the contrary, for long documents.
This is because the rear probabilities are defined as (ignoring the denominator):
P(class|document) = P(class) * P(document|class)
which expands to
P(class|document) = P(class) * P(feature1|class) * ... * P(featureK|class)
It is clear from this that shorter documents with fewer functions will have higher posterior probabilities simply because fewer members multiply together.
For example, suppose the functions foo, bar, and baz appear in positive training observations. Then a document with a single function "foo" will have a higher back chance of being classified in a positive class than a document with features {"foo", "bar", "baz"}. This seems controversial, but I'm not quite sure how to solve this.
Is there any normalization of length that can be done? One idea is to add the size of the document as a function, but this does not seem completely correct, since the results will be distorted by the size of the documents in the training data.