Markov text generation

Question

Markov text generation

We were assigned a new project in my class of data structures - creating text with chains of marks.

Overview

Given the input text file, we create an initial seed of length n characters. We add this to our output line and select our next character based on frequency analysis.

This is a cat, and there are two dogs.

Initial seed: "Th" Possible next letters -- i, e, e Therefore, probability of choosing i is 1/3, e is 2/3. Now, say we choose i. We add "i" to the output string. Then our seed becomes hi and the process continues.

My decision

I have 3 classes, Node, ConcreteTrie and Driver

Of course, the ConcreteTrie class is not a Trie of traditional meaning. Here's how it works:

Given a sentence with k = 2:

This is a cat, and there are two dogs.

I generate nodes Th, hi, is, ... + ..., gs, s. Each of these nodes has children, which are the letter that follows them. For example, Node Th will have children of i and e. I maintain calculations in each of these nodes so that I can then generate probabilities of choosing the next letter.

My question is:

First of all, what is the most effective way to complete this project? My decision seems very quick, but I really want to bring down my professors. (In my last project. Changing the Change Distance task, I did A *, the genetic algorithm, BFS and Simulated Annealing - and I know that the problem is NP-Hard)

Secondly, what is the point of this assignment? In fact, this does not look like much of what we covered in the class. What should we learn?

+6

java algorithm

dacman Oct 27 '09 at 4:25

source share

2 answers

You use the bigram approach with characters, but usually it applies to words, because the output will be more meaningful if we use only a simple generator, as in your case).

1) From my point of view, you are all well. But maybe you should try a little randomization of the next node? For example. select a random node from the 5 highest. I mean, if you always choose node with the highest probability, your output line will be too homogeneous.

2) I did the same homework at my university. I think the point is to show students that Markov chains are powerful, but without extensive study of the output of the generator application domain it would be ridiculous

0

Ilya Khaprov Oct 27 '09 at 6:40

source share

mjv · Accepted Answer · 2009-10-27T06:36:10+0000

On the relevance of this assignment with what you examined in class (your second question). The idea of the "data structure" class is to expose students to so many structures that are often found in CS: lists, stacks, queues, hashes, trees of different types, graphs in general, matrices of different creeds and greed, etc. to give some idea of their common implementations, their strengths and weaknesses and, as a rule, their various fields of application.
Since most of any games / puzzles / problems can be compared with some sets of these structures, there is no shortage of subjects on which lectures and tasks can be treated. Your class seems interesting , because by focusing on these structures, you are also given the opportunity to find real applications .
For example, in the disguised form, “a cat and two dogs” is an introduction to the statistical models applied to linguistics. Your curiosity and motivation prompted you to establish a connection with Markov models, and this is good, because most likely you will meet Markov several times before graduation ;-) and, of course, in your professional life in CS or an adjacent domain. So yes! It may seem that you are greasing a lot of applications, etc., but as long as you feel what structures and algorithms are chosen in certain situations, you are not wasting your time

Now some tips on possible prescribing approaches.
trie seems like natural support for this type of problem. Maybe you yourself can ask yourself how this approach will scale if you needed to specify the whole book, and not this short sentence. This is mostly linear, although it depends on how each choice is made in three jumps in three (for this Markov chain of the 2nd order): as the number of options increases, the choice of path may become less effective.
A possible alternative repository for constructing the index is the stochatisc matrix (actually “simple”, if only the sparse matrix turned stochastically at the end when you normalize each row or column, depending on how you set it) to summarize up to one (100%). Such a matrix will be approximately 729 x 28 and will allow indexing of the two-letter tuple and the next letter associated with it in one operation. (I got 28 for turning on the “start” and “stop” signals, details ...)
The cost of this more efficient indexing is the use of additional space. Spatially trie is very effective, only effectively saving combinations of letter triplets, the matrix, however, spends some space (you bet, in the end, it will be very sparsely populated even after indexing more text that the sentence is “dog / cat”.)
This size and trade-off between processors is very common, although some algorithms / structures are better than others, according to both calculations ... In addition, the matrix approach will not scale beautifully, size-size, if the problem has been changed to base the choice letters from the previous word, three characters.
However, perhaps look at the matrix as an alternative implementation. In the spirit of this class, it is very important to try different structures and understand why / where they are better than others (in the context of a specific task).
A small side way you can take is to create a tag cloud based on the probabilities of pairs of letters (or triplets): both trie and matrix contain all the necessary data for this; a matrix with all its interesting properties may be more suitable for this. Enjoy!

Markov text generation

More articles: