What is this compression algorithm?

So, I have some compressed files as well as uncompressed versions. I do not have the software that originally created these files. I am trying to understand what lies at the heart of the algorithm - can you figure it out? Initially, I thought it might be some kind of LZW option, but I'm not sure. This data seems to make more sense when it is broken down into 6-bit words - I see many repeating patterns.

The two files are very similar, and the uncompressed versions differ only in a few bytes - this can help establish where these different bytes are in the compressed files. I highlighted the differences.

Compressed file # 1:

  02 02 01 17 0E 11 92 14 C0 55 52 44 FF BC AE 47 DB E1 05 42 F8 70 DE 57 23 FF
 54 1A 55 3D BF 54 10 E3 38 0 C B2 FB C 4 92 1C 20 DE 57 23 FF 54 1A 55 3D BE 5E
 4C 96 B2 0E 32 80 CB 2F BC 48 70 83 79 5C 8F FD 50 69 54 F6 F9 96 48 A9 07 19
 C2 30 F0 E1 BC AE 47 FE A8 34 AA 7B 7E 32 BF E5 1F EE A8 48 CA 11 87 87 0D E5
 72 3F F5 41 A5 53 DB E5 24 5D F8 CA FF 4C B1 13 8C 71 18 7B C3 86 F2 B9 1F FA
 A0 D2 A9 ED FD 55 97 BA 22 32 C0 CB 2F BC

Compressed file # 2:

  02 02 01 17 0E 11 92 14 C0 55 52 44 FF BC AE 47 DB E1 05 42 F8 70 DE 57 23 FF
 54 1A 55 3D BF 54 10 E3 38 0 D 36 D4 0 4 92 1C 20 DE 57 23 FF 54 1A 55 3D BE 5E
 4C 96 B2 0E 32 80 D3 6D 40 48 70 83 79 5C 8F FD 50 69 54 F6 F9 96 48 A9 07 19
 C2 30 F0 E1 BC AE 47 FE A8 34 AA 7B 7E 32 BF E5 1F EE A8 48 CA 11 87 87 0D E5
 72 3F F5 41 A5 53 DB E5 24 5D F8 CA FF 4C B1 13 8C 71 18 7B C3 86 F2 B9 1F FA
 A0 D2 A9 ED FD 55 97 BA 22 32 C0 D3 6D 40

Uncompressed file # 1:

  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 2A 2A 2A 2A 2A 20 47 52 41 4E 44 20 54 4F 54 41 4C 53 20
 2A 2A 2A 2A 2A 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 54 54 4F 54 41 4C 20 52 45 43 4F 52 44
 53 20 52 45 41 44 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 32 32 38 37 0D 0A
 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 54 4F 54 41 4C 20 52 45 43 4F 52 44 53 20 42 59 50
 41 53 53 45 44 20 20 20 20 20 20 20 20 20 20 32 32 38 37 0D 0A 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 54 4F 54 41 4C 20 52 45 43 4F 52 44 53 20 43 48 41 4E 47 45 44 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 30 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 54 4F 54 41 4C
 20 52 45 43 4F 52 44 53 20 4E 4F 54 20 4F 4E 20 58 52 45 46 20 20 20 20 20 20
 20 20 20 20 30 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 54 54 4F 54 41 4C 20 52 45 43 4F 52 44
 53 20 42 41 4E 4B 20 4E 4F 54 20 46 4F 55 4E 44 20 20 20 20 20 20 20 20 30 0D 0A
 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 54 4F 54 41 4C 20 52 45 43 4F 52 44 53 20 57 52 49
 54 54 45 4E 20 20 20 20 20 20 20 20 20 20 20 20 32 32 38 37

Uncompressed file # 2:

  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 2A 2A 2A 2A 2A 20 47 52 41 4E 44 20 54 4F 54 41 4C 53 20
 2A 2A 2A 2A 2A 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 54 54 4F 54 41 4C 20 52 45 43 4F 52 44
 53 20 52 45 41 44 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 33 34 33 39 0D 0A
 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 54 4F 54 41 4C 20 52 45 43 4F 52 44 53 20 42 59 50
 41 53 53 45 44 20 20 20 20 20 20 20 20 20 20 33 34 33 39 0D 0A 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 54 4F 54 41 4C 20 52 45 43 4F 52 44 53 20 43 48 41 4E 47 45 44 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 30 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 54 4F 54 41 4C
 20 52 45 43 4F 52 44 53 20 4E 4F 54 20 4F 4E 20 58 52 45 46 20 20 20 20 20 20
 20 20 20 20 30 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 54 54 4F 54 41 4C 20 52 45 43 4F 52 44
 53 20 42 41 4E 4B 20 4E 4F 54 20 46 4F 55 4E 44 20 20 20 20 20 20 20 20 30 0D 0A
 0D 0A 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 20 20 20 20 20 20 20 20 20 54 4F 54 41 4C 20 52 45 43 4F 52 44 53 20 57 52 49
 54 54 45 4E 20 20 20 20 20 20 20 20 20 20 20 20 33 34 33 39

As you can see, the output files are just ASCII text files. Any ideas?

+7
source share
1 answer

This, apparently, is some kind of proprietary encoding format designed to shave off some bits of certain types of messages.

It runs on an 8-bit (ascii) input and produces a bit stream using a mixture of 5 and 6-bit tokens, including some control characters.

The following markers can be identified:

// 5 bit tokens: 00000 switch to 6 bit mode 00011 take the following 6 bits as N, and output N spaces 00100 A 00101 B ..... 11101 Z 11110 <crlf> 11111 space // 6 bit tokens: 000001 switch to 5 bit mode 000011 take the following 6 bits as N, and output N spaces 001001 <crlf> 011000 1 011001 2 ...... 100000 9 // pure speculation: 010111 0 010010 * 000110 repeat the next 6 bit char N times 001100 space 00001 skip 3 bits, take the next 8 bits as ascii, and output N times 

Without additional examples, it is difficult to determine what happens at the beginning of the stream. It may be some kind of magic value or it may contain some control values.

+3
source

All Articles