I am currently working on the selection of several general-purpose hash functions for use in Object.GetHashCode() overrides. Initially, on the recommendation of this site, I started with ELF. My C # implementation is below:
public int Generate(byte[] key) { const uint c = 0xf0000000; uint h = 0, g = 0; unchecked { for (int i = 0, len = key.Length; i < len; i++) { h = (h << 4) + key[i]; if ((g = h & c) != 0) h ^= g >> 24; h &= ~g; } } return (int)h; }
My test case consists of 524,288 unique values โโdivided into short (1-64) and long (256-2048) strings (a limited set of ASCII characters) and arbitrary binary data (131,072 each) to test each algorithm in different circumstances.
I also understand the limitations of this test case. A hash algorithm can work fine when hashing, say, in URLs, but being awful in hashing JPGs or something like that. Random strings / binary data seems to be the best starting point for choosing a general purpose function. I am glad to hear the reasons why this is not so.
I performed 3 separate test runs (each time I generated a new set of random strings / bytes) and averaged the results.
The ELF algorithm caused an awful amount of collisions compared to the other algorithms I tested:
- Short lines: 817 collisions (~ 0.5% failure).
- Short binaries: 550 collisions (~ 0.4% failure)
- Long string / binary: 34 collisions (~ 0.025% failure).
To put this in context, the other 3 algorithms that I tested produced on average between 3-10 collisions on average for the same tests. He is also among the slowest of 4, so at the moment he seems completely useless.
Full results:
Strings binary
Algorithm short: long short: long
ELF 817: 40 550: 28
FNV 1.6: 2 0.6: 2.6
OAT 9: 9.6 14: 5
Jenkins * 2: 1.3 12: 3.6
* A close approximation of the lookup3 hash function.
So, for the same random samples that ELF fights (I created 3 separate sets), all other proven algorithms give the way to fewer collisions.
I was looking for options for the ELF algorithm, but a few examples that I found seem compatible with what I implemented. The only option I saw concerned this SO question: Using ELF to create a modified hash file . This option includes h &= g >> 24 in the if-block and captures the result up to 31 bits. I tested this variation and it produced the same horrible results.
Did I do something subtle, but terribly wrong? I can't understand why it does so poorly that it is claimed to be widely used on Unix.