What printable ASCII characters usually appear in the English text?

Question

What printable ASCII characters usually appear in the English text?

I am trying to solve Project Euler problem # 59 , and I am having problems because some of them seem somewhat more ambiguous than the previous problems.

As a background, the problem suggests that this text file is encrypted text with ASCII codes stored as numbers. The encryption method is XOR 3 lowercase letters cyclically with clear text (therefore, it is reversible). The problem is the key that decrypts the file into English text. How can I limit the character set of my output to get an answer without trying to sift through all possible plaintexts (26 ^ 3)?

I tried to limit the letters, spaces, and punctuation marks, and that didn't work.

To clarify: I want to determine from all printable ASCII characters which ones I can probably remove and which ones I can expect in a plain text string.

+4

cryptography encryption

murgatroid99 Aug 05 '10 at 12:56

source share

5 answers

A possible solution is to simply assume the presence of a given three-character sequence in the ciphertext. You can use a three-letter word or three-letter sequence, which is likely to appear in the English text (for example, " a " : the letter “a” enclosed between two spaces). Then just try all the possible positions of this sequence in the ciphertext. Each position allows you to simply recalculate the key, and then decrypt all the text into a file.

Since the source text is 1201 long, you can skip 1199 files. This is just patience at this point, but you can do it much faster using a simple text search utility in another frequent English sequence (for example, "are" ), for example, using the Unix grep tool.

I did just that and got the decrypted text in less than five minutes.

+2

Thomas pornin Aug 05 '10 at 20:53

source share

I agree in advance that I am not familiar with the XOR cipher.

However, it seems very similar to the vigenere encryption concept. Especially in the line where they are mentioned for unbreakable encryption, the key length is equal to the length of the message. It yells to Vernam Slate.

As mentioned in another answer, the strategic approach to breaking vigenere encryption involves a probabilistic approach. I will not go into details because most of the theory I studied was relatively complicated, but it can be found here, bearing in mind that vignere are a series of Caesar ciphers.

The problem makes your job easier because you already know the key length. Because of this, as you already mentioned, you can simply pounce by trying each combination of three letters.

Here's what I would like to do: take a reasonable size of the ciphertext, say maybe 10-20 characters, and try using brute force. Keep track of all the keys that appear to create understandable letter sequences and then use them throughout the ciphertext. Thus, we can use the obvious method of forced forcing, but without a rough check of the whole problem, so I don’t think you have to worry about limiting the output.

However, I agree that when creating output, if you ever get a non-printable character, you can probably break your loop and move on to the next key. I would not try anything more specific than this, because who knows what the original message may have, never make assumptions about the data you are dealing with. Such a short circuit logic is always a good idea, especially when implementing brute force solutions.

+1

Nickhalden Aug 05 '10 at 13:16

source share

Divide the ciphertext by 3.

Ciphertext1 contains the numbers 1, 4, 7, 10 ... Ciphertext2 contains the 2nd, 5th, 8th, 11th ... numbers Ciphertext3 contains the 3rd, 6th, 9th, 12- th ... numbers

Now you know that each cyphertext is encrypted with one letter. Now do a standard frequency analysis on it. This should give you enough information about what a letter is.

0

PaulJWilliams Aug 05 '10 at 16:06

source share

I just solved this problem a few days ago. Without ruining it for you, I want to describe my approach to this problem. Some of what I say may be redundant for what you already knew, but were part of my approach.

At first I suggested that the key is exactly as described, three lowercase ASCII letters. So I started to brutally force "aaa" and went to "zzz". When decrypting, if any byte received was below 32 (the ASCII value of the space, the lowest "printed" ASCII value) or above 126 (the ASCII tilde value "~", which is the highest printable character in ASCII) than I assumed that the key was invalid because any value outside 32 and 126 would be an invalid character for plain text stretching of the English language. As soon as one byte goes beyond this range, I stopped decrypting and moved on to the next possible key.

As soon as I decrypted the entire message using a specific key (after passing the first test of all bytes that are print characters), I needed a way to check it as a valid decryption. I expected the result to be a simple list of words with no particular order or meaning. Thanks to another experience in the field of cryptography, I again remembered the frequency of writing, and, most importantly, your average English word in the text has a length of 5 characters. The file contains 1201 input bytes. This means that on average there will be 240 words. After decryption, I calculated how many spaces were in the resulting output line. Since Project Euler is nothing more than an average, I compared the number of spaces to 200, taking into account longer and more obscure words. When the output was more than 200 spaces, I printed out the key, which was decrypted, and the output text. The answer and the only way out that has over 200 spaces. Let me tell you that it is more than obvious that you have an answer when you see it.

Something to indicate that the answer to the question is NOT the key. This is the sum of all the ASCII values of the output string. This approach also solves the equation under the one minute mark, in fact, it is about 3 or 4 seconds.

0

Corey ogburn May 20, '11 at 20:42

source share

Daniel Wedlund · Accepted Answer · 2010-08-05T13:02:57+0000

Have you tried two of the most “basic” and general tools when analyzing the algorithm used?

Analyze the frequency of characters and try to match it with the frequency of letters in English
Bruteforce, using keys from a list of words, most often ordinary words are used as keys from "dumb" users.

To analyze the frequency for this particular problem, you will have to split the row into every third element, since the key is 3 in length, now you can create three columns:

79 59 12 2 79 35 8 28 20 2 3 68 ...

you need to analyze the frequency for each column, since now they are independent of the key.

Well, it actually took my time, built 3 full columns and counted the frequency for each of the columns and got the two most commonly used elements or each column:

 Col1 Col2 Col3 71 79 68 2 1 1

Now, if you check, for example: http://en.wikipedia.org/wiki/Letter_frequency You have the most frequent letters, and do not forget that you have spaces and other characters that are not on this page, but I think you can assume that space is the most frequent character.

So now it’s just a question: the most frequent characters in the table, which I supplied with the most frequent characters in English, and to see if you have lowercase characters, I found a three-letter word, which I think is the answer only with this data.

Good luck and by the way, it was a good problem!

What printable ASCII characters usually appear in the English text?

More articles: