Can I encode a string into an integer in such a way as to preserve the closeness of the lexicographic string?

I would like to encode variable-length strings (usually 1-100 characters) to integers so that lexicographic-like strings (they would be close to each other in the dictionary) would result in integers close to each other, ensuring that these integers are evenly distributed over the range of possible integer values.

I understand that in order to ensure even distribution, some polling of possible strings may be required before they are encoded.

Does anyone have any ideas on how to do this?

+7
source share
4 answers

Keyboard shortcuts may be useful here. The idea is to compare a set of strings and remove all bits that are similar. Which creates a set of almost unique keys small enough to fit into an integer. See Chapter 6, β€œFAST: Finding a Tree-Sensitive, Fast Architecture on Modern Processors and GPUs.”

The described algorithm does not always preserve the lexicographical order, but can be supplemented to do this.

Edit

A more general approach is to split the string characters into independent parts (if possible), then determine the probabilities of these parts and apply arithmetic coding .

Edit2

To put more string in a compressed key, it may be preferable to use some entropy encoding where a character encoding includes values ​​of several, but no more than 1 .. 2 previous characters (too high compressibility will degrade performance). Or, if the integer key should be short enough (for example, 16 bits), it is better to use entropy methods to pre-calculate all the keys and put them in a collection sorted by lines; in this case, the encoding prefix can be much longer.

+2
source

A common approach would be to use the first n characters in your string with a zero byte, if necessary, as an integer. Reduce your alphabet accordingly, and you should achieve pretty tight packaging. Example: Suppose your Base64 input alphabet with / represents the end of a line. You can use the string "word /" by setting the six highest bits of your integers to 48, the next six to 40, etc. Pad with two zeros, and you get the exact representation in a 32-bit integer.

Lexicographically close words will have similar beginnings and thus similar most significant bits.

Naturally, words longer than 5 characters have hash collisions, but this cannot be avoided.

+2
source

Your requirements are pretty tough. How about using a minimal perfect hash function? This ensures that if you give the lines in lexicographical order:

s1 < s2 < s3 < s4 < ... < sN 

they will be mapped to consecutive integers in the range [0..N-1]. See the following documents:

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2010/01_appoggiomg-minordhash.pdf

http://vigna.dsi.unimi.it/ftp/papers/MonotoneMinimalPerfectHashing.pdf

0
source

It's impossible. Suppose you come up with some function to match strings to integers. Then, suppose you match the first input string s1 with the integer i1 and match the second input string s2 with i2. Then the problem is in the next lines of input. You only have a place for | i2 - i1 | more input lines that are between s1 and s2. But there is no way to guarantee that you will not get more than | i2 - i1 | strings that are between s1 and s2, at least practically (you have to use integers of the order of 26 ^ 100 for strings of the same case with up to 100 characters).

0
source

All Articles