Decoding Swapped English Strings

Question

Decoding Swapped English Strings

This was recently asked by a colleague when trying to get (other) research work:

Given the 10 128-character strings that were swapped exactly the same, decode the strings. The source lines are English text with spaces, numbers, punctuation, and other non-alpha characters.

He was given several days to think about it before an answer was expected. How would you do that? You can use any computer resource, including language models of characters / words.

+7

algorithm puzzle nlp

Nate glenn Sep 04 '11 at 10:06

source share

4 answers

First you will need a test for the correct order. something rather simple, like being able to break most texts into words using a dictionary sorted by frequency of use without backtracking.

you have it, you can play with different approaches. two I would try:

using the genetic algorithm, with the calculation based on 2 and 3 letter tuples (which you can either get or create yourself). the complex part of genetic algorithms finds a good description of a process that can be fragmented and recomposed. I would suggest that something like “move fragment x to after fragment y” would be a good approach, where the indices will be positions in the source text (and therefore change when reading “dna”). In addition, you may need to expand the score with something that brings you closer to the "real" text near the end - something like the length at which the verification algorithm runs, or full words are found.
using a graphical approach. you will need to find a consistent path through a graph of letter positions, possibly with a search for the beam width, using weights obtained from paired frequencies. I'm not sure how you deal with reaching the end of the line and restarting. maybe 10 sentences are enough to identify with good probability good starting candidates (from the frequency of letters) - would not surprise me.

This is a good problem: o) I suspect that 10 sentences are a strong limitation (at each step you have good chances for common pairs of letters in several lines - you probably want to combine the probabilities, rejecting the most unlikely ones, if only you include the beginning pairs / end of word), so I think the graph approach will be most effective.

+1

andrew cooke Sep 04 '11 at 10:24

source share

First you need a scoring function, which increases as the likelihood of a correct permutation increases. One approach is to pre-compute the triplet frequencies in standard English (get some data from Project Gutenburg) and add the frequencies of all triplets in all ten lines. You may find that quadruplets give better results than triplets.

Secondly, you need a way to create permutations. One approach, known as climbing a hill, takes ten strings and enters a loop. Select two random numbers from 1 to 128 and replace the related letters in all ten lines. Calculate the score of the new permutation and compare it with the old permutation. If the new permutation is an improvement, save it and loop, otherwise save the old permutation and loop. Stop when the number of improvements slows below a certain threshold. Present the result to a user who can accept it as given, accept it and make changes manually or reject it, in which case you will start again from the original set of lines at another point in the random number generator.

Instead of climbing a hill, you can try to simulate annealing. I will tell you about this on Google, but the idea is that instead of saving the best of the two permutations, sometimes you save the smaller of the two permutations, hoping that this will lead to a better overall result. This is done in order to defeat the tendency to climb the hill, to get stuck at a local maximum in the search space.

By the way, it is "rearranged", but not "rearranged".

0

user448810 Sep 05 '11 at 2:17

source share

Frequency analysis will dramatically reduce the search space. The most common letters in English prose are famous .

Count the letters in the encrypted input and put them in the most general order. Comparing most of them, which are considered the most significant, translated the cypher text back into plain text. This will be close to correct, but most likely not entirely accurate. Hand, iteratively configure your permutation until plain text appears (usually iterates a few times).

If you find the check manually odious, run a simple text attempt with a spell check and minimize the number of violations.

0

phs Sep 08 '11 at 5:29

source share

Iterator · Accepted Answer · 2011-09-08T03:49:17+0000

This is a basic transpositional cipher . My question above was to determine if it was a transpose cipher or a substitution cipher. Cryptanalysis of such systems is quite simple. Others have already referred to basic methods. Optimal approaches will try to first place the most complex and rare letters, as they will seek to uniquely identify the letters around them, which greatly reduces the subsequent search space. Just finding a place to put “a” (not a pun intended) is not difficult, but finding a place to place “q”, “z”, or “x” is a bit more work.

The main goal of the quality of the algorithm is not to decrypt the text, since it can be done better than brute force methods, and just be quick, but it should eliminate the possibilities as quickly as possible.

Since you can use multiple lines at the same time, trying to create words from the rarest characters will allow you to test attacks on words in parallel. Finding the correct placement of the rarest terms in each line will decrypt this encrypted PLUS text of all the others at the same time as quickly as possible.

If you are looking for cryptanalysis of transposition ciphers, you will find a link to genetic algorithms. They are designed to advance the research circle of people working in GA, because they are actually not optimal in practice. Instead, you should look at some basic optimization methods, such as branch and bound, A * and many statistical methods. (How deep you should go depends on your level of knowledge in algorithms and statistics. :) I have switched several times between deterministic methods and statistical optimization methods.)

In any case, the calculations should be cheap and fast, because the scale of the initial guesses can be quite large. It’s best to have a cheap way to filter out a lot of possible placements first, and then spend more CPU time sifting through the best candidates. To this end, it is good to have a way of describing the processing steps and the computational effort for each step. (At least what I would expect if I gave this as an interview question.)

You can even buy a fairly reliable guide to deciphering double transposition ciphers.

Update 1: Browse through these slides for more ideas on iterative improvements. This is not a great slide set, but it is easily accessible. Moreover, although the slides relate to GA and simulated annealing (methods that come a lot in search results for cryptanalysis of transpositional encryption), the author advocates such methods when you can use A * or other methods. :)

Decoding Swapped English Strings

More articles: