A common approach would be to use the first n characters in your string with a zero byte, if necessary, as an integer. Reduce your alphabet accordingly, and you should achieve pretty tight packaging. Example: Suppose your Base64 input alphabet with / represents the end of a line. You can use the string "word /" by setting the six highest bits of your integers to 48, the next six to 40, etc. Pad with two zeros, and you get the exact representation in a 32-bit integer.
Lexicographically close words will have similar beginnings and thus similar most significant bits.
Naturally, words longer than 5 characters have hash collisions, but this cannot be avoided.
thiton
source share