Why such a high collision rate with my ELF hash implementation

Question

Why such a high collision rate with my ELF hash implementation

I am currently working on the selection of several general-purpose hash functions for use in Object.GetHashCode() overrides. Initially, on the recommendation of this site, I started with ELF. My C # implementation is below:

 public int Generate(byte[] key) { const uint c = 0xf0000000; uint h = 0, g = 0; unchecked { for (int i = 0, len = key.Length; i < len; i++) { h = (h << 4) + key[i]; if ((g = h & c) != 0) h ^= g >> 24; h &= ~g; } } return (int)h; }

My test case consists of 524,288 unique values divided into short (1-64) and long (256-2048) strings (a limited set of ASCII characters) and arbitrary binary data (131,072 each) to test each algorithm in different circumstances.

I also understand the limitations of this test case. A hash algorithm can work fine when hashing, say, in URLs, but being awful in hashing JPGs or something like that. Random strings / binary data seems to be the best starting point for choosing a general purpose function. I am glad to hear the reasons why this is not so.

I performed 3 separate test runs (each time I generated a new set of random strings / bytes) and averaged the results.

The ELF algorithm caused an awful amount of collisions compared to the other algorithms I tested:

Short lines: 817 collisions (~ 0.5% failure).
Short binaries: 550 collisions (~ 0.4% failure)
Long string / binary: 34 collisions (~ 0.025% failure).

To put this in context, the other 3 algorithms that I tested produced on average between 3-10 collisions on average for the same tests. He is also among the slowest of 4, so at the moment he seems completely useless.

Full results:

  Strings binary
 Algorithm short: long short: long
 ELF 817: 40 550: 28
 FNV 1.6: 2 0.6: 2.6
 OAT 9: 9.6 14: 5
 Jenkins * 2: 1.3 12: 3.6

 * A close approximation of the lookup3 hash function.

So, for the same random samples that ELF fights (I created 3 separate sets), all other proven algorithms give the way to fewer collisions.

I was looking for options for the ELF algorithm, but a few examples that I found seem compatible with what I implemented. The only option I saw concerned this SO question: Using ELF to create a modified hash file . This option includes h &= g >> 24 in the if-block and captures the result up to 31 bits. I tested this variation and it produced the same horrible results.

Did I do something subtle, but terribly wrong? I can't understand why it does so poorly that it is claimed to be widely used on Unix.

+7

c # .net hash

Quick joe smith Feb 06 '12 at 12:07

source share

3 answers

Eric Lippert · Answer 1 · 2012-02-06T15:51:18+0000

The expected number of collisions in 524,000 random samples over a 32-bit hash is 34.

You get 34 collisions with long strings, so for long strings this algorithm works more or less as expected.

Hash conflicts are much shorter than short lines, because the data has much less entropy, so I'm not at all surprised that you get an order of magnitude lower performance on small lines.

Surprisingly, you only get ten collisions with other hashing algorithms. I would expect much more.

Regarding speed of speed: you can stop being so smart. Jitter can recognize and optimize an extremely common pattern:

 for(int i = 0; i < array.Length; ++i) do something with array[i]

to avoid recalculating the length and to avoid checking the range of access to the array . By trying to be smart and avoiding length recalculation, you can fool the jitter so that you no longer optimize the range check.

If you want to always avoid range checking, you can always go to unsafe code; fix the array in place, get a pointer to it, and then increase the pointer as you write the program in C. You take responsibility for ensuring memory security at this point, but the chances are good, it will be faster.

Of course, analyzing the effectiveness of the “chair” is worth what you paid for it; to get a real analysis, try and see what happens.

Ben · Answer 2 · 2012-02-06T13:03:00+0000

This is not a cryptographic hash, it is a hash hash.

This is quite reasonable performance for a hash function intended for use in a hash table. As a rule, you will store from hundreds to hundreds of thousands of objects and want to quickly save and receive objects.

You do this by dividing into buckets, each of which contains a linked list (or maybe an array). Then you calculate the hash, and, taking the remainder, dividing by the number of buckets, you define the bucket. Then you go through a linked list comparing each object to find the one you want.

If the bucket is empty, the object was not found. You can then create or perform another appropriate action, depending on your application.

The hash table should have the size of about the same number of buckets as the expected number of items to store (or slightly larger), so in most searches, a bucket with zero, one or two entries will be found.

For performance, you want to balance the cost of computing the hash by going through a very short linked list if you get a collision. It is for this purpose that an implementation of ELF and similarly assigned hash functions is being developed.

In short:

In a hash table, a random collision is the price to pay for a faster hash.
In a cryptographic hash, a slow hash is the price that is worth paying for collision avoidance.

If a collision is a problem in your application, use SHA1 or SHA256 or something related to it.

Note. . For use as an implementation of object.GetHashCode (), the hash code is intended only to speed up comparisons ("fail fast") and for use in hash tables. You do not need it to be fully collision resistant, as you are going to return to a full equality match if it collides. You need balanced performance . I suggest simply hashing the most important fields (using their own GetHashCode() ) and XORing values.

Edit: See also these hashes here:

http://www.burtleburtle.net/bob/hash/doobs.html

Ernest friedman-hill · Answer 3 · 2012-02-06T12:38:47+0000

the actual ELF implementation returns an unsigned long , and the source uses an unsigned long internally. I can’t say for sure, but my intuition is that your implementation simply discards too many interesting bits while working in int .

Why such a high collision rate with my ELF hash implementation

More articles: