Finding a good 64-bit hash for file paths in UTF16

Question

Finding a good 64-bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoding . path separators are U + 005C '\'. Paths are paths to root relative window paths with zero completion, for example. "\ Windows \ system32 \ Drivers \ myDriver32.sys"

I want to pass this path to a 64-bit unsigned integer. For no cryptographic sounds required . Hashes must be case insensitive , but capable of handling letters other than ascii. Obviously, the hash should also change well.

There are some ideas that I had:

A) Using the Windows file identifier as a "hash". In my case, I want the hash to change if the file moves, so this is not an option.

B) Just use a regular hash of the hash: hash + = prime * hash + codepoint for the entire line.

I get the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be used.

To summarize:

1) 64-bit hash
2) good distribution / multiple collisions for file system paths.
3) effective 4) no need to be safe
5) case insensitive

+6

path collision hash-collision hash utf-16

Dominik weber 15 Sep '10 at 20:12

source share

4 answers

Even if you don’t need a cryptographic hash, you can still use it, and since your problem is not safe, then a “broken” cryptographic hash will be fine. I suggest MD4 , which is pretty fast. On my PC (2.4 GHz Core2, using a single core) MD4 hashes over 700 MB / s, and even for small inputs (less than 50 bytes) it can process about 8 million messages per second. You may find faster non-cryptographic hashes, but this requires a rather specific situation to make a measurable difference.

For the specific properties you need, you will need:

To "normalize" the characters so that the uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case insensitivity in the Unicode world is not an easy task. From what you explain, I understand that you are only after the same case insensitivity that Windows uses to access files (I think it is ASCII-only, so converting to uppercase → lowercase is simple).
/ li>
Truncate MD4 output. MD4 produces 128 bits; just use the first 64 bits. It will be as scattered as you might wish.

There are MD4 versions available in features, including directly in RFC 1320 I, link to above. You can also find the open source implementation of MD4 in C and Java in sphlib .

+2

Thomas pornin Sep 16 '10 at 13:56

source share

I would just use something simple. I do not know which language you are using, so the following pseudo code:

ui64 res = 10000019; for(i = 0; i < len; i += 2) { ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]); res = res * 8191 + merge; // unchecked arithmetic } return res;

I assume that path[i + 1] is safe on the grounds that if len is odd, then in the latter case it will be safe to read U + 0000.

I would not use the fact that there are spaces caused by spaces in UTF-16, lowercase letters and character headers, as well as characters that are not allowed in paths, because they do not spread in such a way as to use this fact, which could be quickly used . A decrease of 32 (all characters below U + 0032 are not allowed in path names) will not be too expensive, but it will also not improve hashing.

+2

Jon hanna Sep 22 '10 at 16:19

source share

You can simply create a shared library in C # and use the FileInfo class to get the full path to the directory or file. Then use the .GetHashCode () path in the path, for example:

 Hash = fullPath.GetHashCode();

or

 int getHashCode(string uri) { if (uri == null) throw new ArgumentNullException(nameof(uri)); FileInfo fileInfo = new FileInfo(uri); return fileInfo.FullName.GetHashCode(); }

Although this is just 32-bit code, you duplicate it or add another HashCode based on some other characteristics of the file.

+1

reinier Nov 01 '16 at 14:24

source share

Archimedix · Accepted Answer · 2010-09-15T20:34:47+0000

Cryptographically secure hashes may not be very efficient in terms of speed, but there are implementations available for almost any programming language.
Is it possible to use them for your application depends on how much you depend on speed - a benchmark test will give you the appropriate answer.

You can use a substring of such a hash, for example. MD5 is in your path, previously converted to lowercase, so the hash is not case sensitive (it is required that you use a method for the lower case that knows how to convert all non-standard UTF-16 characters that may occur in the file system).

Cryptographically secure hashes have the advantage of being fairly evenly distributed regardless of how much of the substring you take, because they are designed for unpredictability, i.e. each part of the hash ideally depends on all the hashed data, just like any other part of it.

Finding a good 64-bit hash for file paths in UTF16

More articles: