Character sequence prediction?

I am new to mechanical training, so please go ahead if the problem is trivial.

I was given a sequence of observable characters, say ABABBABBB ..... (n characters). My goal is to predict the following symbols with some "learning" mechanisms. My limitations are that the observed characters (training data?) Are not too many. I will say that a 6000-length sequence for exploring the underlying pattern

I'm pretty confused about which strategy I need to solve to solve this problem, my initial bets are: 1) some kind of ngram model? 2) Neural networks (e.g. LSTM), etc.? 3) HMMs

Can you give directions on the right approaches to solve this problem?

+8
text machine-learning neural-network n-gram lstm
source share
2 answers

Your problem is time series analysis. For this, you should also consider statistics and intelligence analysis (EDA), in addition to machine learning algorithms.

  • I would start by assigning numbers to characters (A-> 1, B-> 2, etc.). It is usually not recommended to convert nominal variables (values โ€‹โ€‹without order) into ordinal ones, (2 is greater than 1, but โ€œCโ€ is greater than โ€œAโ€ or โ€œRedโ€ is greater than โ€œGreenโ€ ?!), but in this case, it will change your problem is the absolute analysis of the time series.

  • Then I would use the usual EDA approaches like 4-plot or autocorrelation . does this tell you a lot about the statistical behavior of the data, for example, "is the average data shift"? or "how random can the data set be?" Subsequently, you probably would have a better decision about which machine learning algorithm to use.

  • Depending on what you can find in the EDA analysis, you can continue to implement ML algorithms. If you have highly correlated data (perceived from the autocorrelation graph), then you probably choose a moving window approach in your choice of function, i.e. Assuming that each value depends on previous values โ€‹โ€‹of k ( x_k = f(x_(k-1),x_(k-2),...,x_(km)) ). m value can be selected by analyzing the autocorrelation graph. If you have a moving average, it would be nice to first study the average curve, and then continue to study the offset of each instance from the average. If you notice a degree of randomness in the average curve or instance offsets, then you can choose a stochastic approach using your prediction problem.

In general, the EDA philosophy is โ€œthe analysis must come before choosing a model,โ€ and I think it's true. If you know more about what you are dealing with, then you will definitely have the best challenge to choose a model.

+3
source share

If you are dealing with a rather trivial pattern, where the letters are based only on the previous one, then you will notice that the Hidden Markov Model (HMM) will solve it - in fact, something simple how the Markov chain will work.

If you want to have some fun, here's an HMM-based solution that you can play with.


Go to the sample data and create a linked list of each item in the order in which they were inserted. Now create a different list for each other character and place the index of each list item where it belongs. Here's a (very poorly drawn) visual representation of the linked list and the bucket below it:

Linked List Over Arrays

Now that you are presented with a sequence and asked to predict the next character, you only need to look at the last X-characters and see how subsequences similar to it work.

To use my example above, look at the last (last) 3 characters to get a BAC . You want to see if the BAC sequence has ever been before, and what happened after it when it happened. If you check the bucket for the first letter of BAC ( B ), you will see that the letter B appeared earlier. Fortunately, he follows the sequence - and after him appeared A , so that would be a prediction.


You can not only check the sequences of the past X, but also each number below X, giving each less weight if the sequence matches to create a better heuristic.

The hard part is deciding how far to look - if you look too far, it will take too much time and you will not be able to get any matches. If you look too short, you can skip the pattern and guess.

Good luck - I hope itโ€™s nice and easy to implement and works for you.

+2
source share

All Articles