Exact learning algorithm for the hidden Markov model

In most cases, the Baum-Welsh algorithm is used to train the hidden Markov model.

However, many documents claim that the BW algorithm will be optimized until it gets stuck in a local optimum.

Is there an exact algorithm that can actually find a global optimum (with the exception of listing almost all possible models and evaluating them)?

Of course, for most applications, BW will work fine. Nevertheless, we are interested in finding lower estimates of the amount of information loss while reducing the number of conditions. Therefore, we always need to create the best possible model.

Thus, we are looking for an efficient NP-hard algorithm (which lists only the (potentially) exponential number of extreme points), and not over the discretized number of floating points for each probability in the model.

+7
algorithm artificial-intelligence hidden-markov-models
source share
2 answers

Finds a quick search at http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec06/node6.html "In this case, the problem of finding the optimal set of parameters $ \ Theta ^ {\ ast } $, as you know, is NP-complete. The Baum-Welsh algorithm [2], which is a special case of EM technology (Waiting and Maximizing), can be used to heuristically find a solution to the problem. " Therefore, I assume that the EM variant, which is guaranteed to find a global optimum in polynomial time, proves that P = NP is unknown and, in fact, probably does not exist.

This problem is almost certainly not convex, because in reality it will be multiple globally optimal solutions with the same scores - taking into account any proposed solution that usually gives a probability distribution for observations, given the ground state, you can, for example, rename hidden state 0 as hidden state 1 and vice versa, adjust probability distributions so that the observed behavior generated by two different solutions is identical. Thus, if there are N hidden states, then there are at least N! local optima created by interchanging latent states.

Another EM application https://www.math.ias.edu/csdm/files/11-12/amoitra_disentangling_gaussians.pdf presents an algorithm for finding the global optimal model of a Gaussian mixture. He observes that the EM algorithm is often used for this problem, but points out that he is not guaranteed to find a global optimum and does not refer as linked work to any version of the EM that might (this also says that the EM algorithm is slow).

If you are trying to do some kind of likelihood test between, for example, a 4-state model and a 5-state model, it would be clearly embarrassing if, due to local optima, the 5-state model had a lower probability than the 4-state model . One way to avoid this or recover from it is to run an EM with 5 states from a starting point very close to the starting one of the best four-piece models. For example, you can create a fifth state with epsilon probability and with an output distribution that reflects the average value of the output distributions of 4 states, keeping the distributions of 4 states as the other 4 distributions in the new 5-state model, multiplying by a factor (1-epsilon) somewhere, so that everything is added to one.

+4
source share

I think that if you really want this, you can determine the area of ​​convergence taking into account the local optimum. If you can have rather weak conditions, you can quickly show that either the entire field is in the convergence area, or there is a second local mini-terminal.

For example, suppose that in the example I have two independent variables (x, y) and one dependent variable (z), and let a given local minimum z_1 and a pair of starting points that converge to z_1 = (x_1, y_1), P_1 = (x_2, y_1) and p_2 = (x_1, y_3), then I could prove that then the whole triangle z_1, p_1, p_2 is in the region of convergence.

Of course, this is not an approach that works in general, but you can efficiently solve a subclass of tasks. For example, some problems have no area of ​​convergence in a sense, for example. it can solve the problem when the point converges to a different solution than all the points in its vicinity, but many problems have some reasonable smoothness to their convergence to the solution, so you can do it normally.

+1
source share

All Articles