Finds a quick search at http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec06/node6.html "In this case, the problem of finding the optimal set of parameters $ \ Theta ^ {\ ast } $, as you know, is NP-complete. The Baum-Welsh algorithm [2], which is a special case of EM technology (Waiting and Maximizing), can be used to heuristically find a solution to the problem. " Therefore, I assume that the EM variant, which is guaranteed to find a global optimum in polynomial time, proves that P = NP is unknown and, in fact, probably does not exist.
This problem is almost certainly not convex, because in reality it will be multiple globally optimal solutions with the same scores - taking into account any proposed solution that usually gives a probability distribution for observations, given the ground state, you can, for example, rename hidden state 0 as hidden state 1 and vice versa, adjust probability distributions so that the observed behavior generated by two different solutions is identical. Thus, if there are N hidden states, then there are at least N! local optima created by interchanging latent states.
Another EM application https://www.math.ias.edu/csdm/files/11-12/amoitra_disentangling_gaussians.pdf presents an algorithm for finding the global optimal model of a Gaussian mixture. He observes that the EM algorithm is often used for this problem, but points out that he is not guaranteed to find a global optimum and does not refer as linked work to any version of the EM that might (this also says that the EM algorithm is slow).
If you are trying to do some kind of likelihood test between, for example, a 4-state model and a 5-state model, it would be clearly embarrassing if, due to local optima, the 5-state model had a lower probability than the 4-state model . One way to avoid this or recover from it is to run an EM with 5 states from a starting point very close to the starting one of the best four-piece models. For example, you can create a fifth state with epsilon probability and with an output distribution that reflects the average value of the output distributions of 4 states, keeping the distributions of 4 states as the other 4 distributions in the new 5-state model, multiplying by a factor (1-epsilon) somewhere, so that everything is added to one.
mcdowella
source share