Semi-controlled Naive Bayes with NLTK

Question

Semi-controlled Naive Bayes with NLTK

I built a semi-directional version of NLTK Naive Bayes in Python based on EM (Expectation Maximization Algorithm). However, in some iterations of EM, I get negative logarithmic probabilities (the logarithmic probabilities of EM should be positive at each iteration), so I believe there should be errors in my code. After carefully analyzing my code, I have no idea why this is happening. It would be greatly appreciated if anyone could notice errors in my code below:

( References semi-controlled Naive Bayes )

The main loop of the EM algorithm

#initial assumptions: #Bernoulli NB: only feature presence (value 1) or absence (value None) is computed #initial data: #C: classifier trained with labeled data #labeled_data: an array of tuples (feature dic, label) #features: dictionary that outputs feature dictionary for a given document id for iteration in range(1, self.maxiter): #Expectation: compute probabilities for each class for each unlabeled document #An array of tuples (feature dictionary, probability dist) is built unlabeled_data = [(features[id],C.prob_classify(features[id])) for id in U] #Maximization: given the probability distributions of previous step, #update label, feature-label counts and update classifier C #gen_freqdists is a custom function, see below #gen_probdists is the original NLTK function l_freqdist_act,ft_freqdist_act, ft_values_act = self.gen_freqdists(labeled_data,unlabeled_data) l_probdist_act, ft_probdist_act = self.gen_probdists(l_freqdist_act, ft_freqdist_act, ft_values_act, ELEProbDist) C = nltk.NaiveBayesClassifier(l_probdist_act, ft_probdist_act) #Compute log-likelihood #NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class)) #for labeled data, sum logprobs output by the classifier for the label #for unlabeled data, sum logprobs output by the classifier for each label log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data]) log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()]) #Continue until convergence if log_lh_old == "first": if self.debug: print "\tM: #iteration 1",log_lh,"(FIRST)" log_lh_old = log_lh else: log_lh_diff = log_lh - log_lh_old if self.debug: print "\tM: #iteration",iteration,log_lh_old,"->",log_lh,"(",log_lh_diff,")" if log_lh_diff < self.log_lh_diff_min: break log_lh_old = log_lh

Custom gen-freqdists functions used to create the necessary frequency distributions

 def gen_freqdists(self, instances_l, instances_ul): l_freqdist = FreqDist() #frequency distrib. of labels ft_freqdist= defaultdict(FreqDist) #dictionary of freq. distrib. for ft-label pairs ft_values = defaultdict(set) #dictionary of possible values for each ft (only 1/None) fts = set() #set of all fts #counts for labeled data for (ftdic,label) in instances_l: l_freqdist.inc(label,1) for f in ftdic.keys(): fts.add(f) ft_freqdist[label,f].inc(1,1) ft_values[f].add(1) #counts for unlabeled data #we must compute maximum a posteriori label estimate #and update label/ft occurrences accordingly for (ftdic,probs) in instances_ul: map_l = probs.max() #label with highest probability map_p = probs.prob(map_l) #probability of map_l l_freqdist.inc(map_l,count=map_p) for f in ftdic.keys(): fts.add(f) ft_freqdist[map_l,f].inc(1,count=map_p) ft_values[f].add(1) #features not appearing in documents get implicit None values for l in l_freqdist.samples(): num_samples = l_freqdist[l] for f in fts: count = ft_freqdist[l,f].N() ft_freqdist[l,f].inc(None, num_samples-count) ft_values[f].add(None) #return computed frequency distributions return l_freqdist, ft_freqdist, ft_values

+6

python unsupervised-learning machine-learning nltk bayesian

SUP Oct 23 '12 at 13:55

source share

1 answer

seggy · Accepted Answer · 2012-10-23T14:42:50+0000

I think you are summing the wrong values.

This is your code that should calculate the sum of the log errors:

  #Compute log-likelihood #NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class)) #for labeled data, sum logprobs output by the classifier for the label #for unlabeled data, sum logprobs output by the classifier for each label log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data]) log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])

According to the NLTK documentation for prob_classify (on NaiveBayesClassifier) ProbDistI returns an object (not logprob(class) + logprob(doc|class) ). When you get this object, you call the prob method on it for the given label. You probably want to call logprob and deny this return.

Semi-controlled Naive Bayes with NLTK

More articles: