What is the name of this algorithm / subroutine?

I am writing a utility class that converts strings from one alphabet to another, this is useful in situations where you have the target alphabet that you want to use, with a limit on the number of characters available. For example, if you can use lowercase letters and numbers, but only 12 characters, this allows you to compress the timestamp from the alphabet 01234567989 -: to abcdefghijklmnopqrstuvwxyz01234567989 , so 2010-10-29 13:14:00 could become 5hhyo9v8mk6avy (19 characters reduced to sixteen).

The class is designed to convert back and forth between alphabets, as well as to calculate the longest source string, which can be safely stored in the target alphabet, taking into account a certain number of characters.

I thought of publishing this through Google code, but I obviously would like other people to find it and use it - hence the question of what this is called. I had to use this approach in two separate projects: with Bloomberg and my own system, when you need to generate unique file names of a certain length, but want to save some plain text, so the GUIDs are not suitable.

+7
string algorithm compression
source share
3 answers

Your examples have some similarities with Dictionary coder with fixed target and source dictionaries. It's also worth taking a look at Fibonacci coding , which has a fixed target dictionary (variable-length bit) that has a variable target.

I think it also depends on whether it is very important that your target alphabet has fixed width entries - if you allow a fixed alphabet with variable length codes, your compression ratio will approach your entropy, which is much more optimal! If the distribution of the original alphabet is known in advance, the old Huffman tree can be easily generated.

+2
source share

Here is a simple algorithm:

Note that you do not need to pass the alphabet used for coding. In addition, you do not use (and do not pass) the probabilities of the input characters, as in standard compressions, so we just re-encode the data somehow.

In this case, we can assume that the input data in the number are presented with a base equal to the power of the input alphabet. We just need to change its presentation to another base, this is a simple task.

EDITED example:

input alpabet: ABC , output alphabet: 0123456789

message ABAC will be transferred to 0102 in base 3, i.e. 11 (9 + 2) in base 10.

11 to the base 10: 11

We may have a problem decrypting it, because we do not know how many 0-es to use at the beginning of the decoded result, so we should use one of the modifications:

1) somehow encodes the size of the compressed data in the stream.

2) use dummy 1 at the beginning of the stream: this way our example will be:

10102 (base 3) = 81 + 9 + 2 = 92 (base 10).

Now, after decoding, we just need to ignore the first 1 (this also provides basic error detection).

The main problem with this approach is that in most cases (GCD == 1), each new coded character will completely change the result. It will be very difficult and difficult to implement. As a result, we get arithmetic coding as the best solution (in fact, this is a simplified version).

+1
source share

You probably know about Base64, which does the same thing that usually happens the other way around. Too bad, there are too many Google results on BaseX or BaseN ...

0
source share

All Articles