Handling Missing Attributes in the Naive Bayes Classifier

I am writing a Naive Bayes classifier to perform indoor localization from a Wi-Fi signal strength. So far, this works well, but I have some questions about the missing features. This happens often because I use Wi-Fi signals, and WiFi access points are simply not available everywhere.

Question 1 . Suppose I have two classes: Apple and Banana, and I want to classify a test instance of T1, as shown below.

enter image description here

I fully understand how the Naive Bayes classifier works. Below is the formula that I use from the Wikipedia article in the classifier. I use the unified previous probabilities P (C = c), so I omit it in my implementation.

enter image description here

Now, when I calculate the right side of the equation and sort through all the probabilities of the class probabilities, what set of functions do I use? The test instance of T1 uses functions 1, 3, and 4, but the two classes do not have all of these functions. Therefore, when I run my cycle to calculate the product of probability, I see several options for me to loop:

  • Let's move on to combining all the functions from training, namely to functions 1, 2, 3, 4. Since the test instance of T1 does not have function 2, use an artificial tiny probability.
  • Sort only those test instances, namely 1, 3, and 4.
  • Scroll to the features available for each class. To calculate the conditional probability for Apple, I would use functions 1, 2, and 3, and for Banana, I would use 2, 3, and 4.

Which of the above should I use?

Question 2 . Let's say I want to classify a test instance of T2, where T2 has a function not found in any of the classes. I use logarithmic probabilities to help eliminate the underflow, but I'm not sure about the details of the loop. I am doing something like this (in pseudocode similar to Java):

Double bestLogProbability = -100000; ClassLabel bestClassLabel = null; for (ClassLabel classLabel : allClassLabels) { Double logProbabilitySum = 0.0; for (Feature feature : allFeatures) { Double logProbability = getLogProbability(classLabel, feature); if (logProbability != null) { logProbabilitySum += logProbability; } } if (bestLogProbability < logProbability) { bestLogProbability = logProbabilitySum; bestClassLabel = classLabel; } } 

The problem is that if none of the classes has test instance functions (function 5 in the example), then logProbabilitySum will remain 0.0, which will lead to a better logarithmic probability of 0.0 or linear probability of 1.0, which is clearly incorrect. What is the best way to handle this?

+6
source share
2 answers

For the Naive Bayes classifier, the right side of your equation should iterate over all the attributes. If you have attributes that are sparsely populated, the usual way to handle this is to use an m- probability estimate that uses the equivalent sample size to calculate your probabilities. This will prevent the nullification of conditional probabilities when your training data has a missing attribute value. Do a web search of the two highlighted bold terms and you will find numerous descriptions of the m-score formula. A good reference text describing this is Machine Learning by Tom Mitchell. Basic formula

P_i = (n_i + m * p_i) / (n + m)

n_i is the number of training instances in which the attribute has the value f_i, n is the number of training instances (with the current classification), m is the equivalent sample size, and p_i is the previous probability for f_i. If you set m = 0, it will simply revert to the standard probability values ​​(which may be zero for missing attribute values). When m becomes very large, P_i approaches p_i (i.e., Probability prevails in the previous probability). If you do not have a preliminary probability of use, just do it 1 / k, where k is the number of attribute values.

If you use this approach, then for your T2 instance that does not have attributes that are present in the training data, the result will be when the class is most often found in the training data. This makes sense because there is no relevant information in the training data with which you could make the best decision.

+6
source

I would have the desire to simply ignore any functions not found in all classes during training. If you decide to do otherwise, you essentially hallucinate the data, and then relate to it equally to the data that really existed at the classification stage. So my simple answer to question 1 would be to simply make a decision based on function 3 (you just don’t have enough information to do anything else). This is part of what the m score mentioned by @bogatron does.

There is a more complex answer to this question for classes in learning, where some functions are missing, but this will require much more work. The M-score is really a point estimate of the back distribution for p_i (which in your case is mu_i, sigma_i), taking into account your training data, consisting of the previous one on p_i (fraction n_i / n) and likelihood function p (data | p_i). In the case when you are not observing any data, you can essentially return to the previous one for the intellectual dissemination of this function.

Now, how do you rate this earlier? Well, if the number of classes in the problem is small, compared with the number for which there is no function value, you can derive the parameters of the previous one from the classes that have data and consider the predictive distribution for classes that do not have data as simply preceding ( for classes that have data, your predictive distribution is back). Useful pointers to you would be that since you seem to think your data is usually distributed (or at least characterized by its average and standard deviation), the average priority should also be normal for pairing. I would probably prefer not to conclude that your standard deviations were spread early, as this is a bit hesitant if you are new to this.

Please note that this only makes sense if you have enough classes with observations for this function that the missing fraction values ​​are small. In particular, in your example you only have one class with observations, so the best thing you could do for Feature One in the Banana class would be to assume that the uncertainty about mu_1 was represented by a distribution centered around " Apple "mu_1, some arbitrary variance. Or you could assume that their mass was the same, in which case it would not affect the decision, and you could ignore it!

So, unfortunately, the answer to your question 2 is that your code is doing the right thing. If your new test instance has only those functions that were never observed during the training process, how can you hope to choose a class for it? You can do no better than choose according to the previous one.

+1
source

All Articles