I am working on a system in which hash problems will be a problem. Essentially, there is a system that references elements in a hash table + tree structure. However, the system in question first compiles text files containing paths in the structure into a binary file containing hashed values. This is done for performance reasons. However, due to these collisions, it is very bad, because the structure cannot store 2 elements with the same hash function value; the part requesting the item will not have enough information to find out which one is needed.
My initial thought is that 2 hashes, using two different algorithms or the same algorithm twice, with 2 salts will be more resistant to collision. Two elements having the same hash for different hashing algorithms would be very unlikely.
I was hoping to keep the 32-bit hash value for space reasons, so I decided to switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible values of the hash function ...
I know that switching to two 32-bit hashes would be more resistant to collision, but I wonder if switching to 2 16-bit hashes can have at least some gain over one 32-bit hash? I am not the most mathematically inclined person, so I don’t even know how to start checking the answer, but to impose force on him ...
Some background in the system:
Elements are given to the names of people, they are not random strings and usually consist of words, letters and numbers without spaces. This is a nested hash structure, so if you have something like {a => {b => {c => 'blah'}}}, you get the value "blah", getting the value a / b / c compiled the request will have 3 hash values in the immediate sequence, hash values a, b, and then c.
The problem only occurs in a collision at a given level. The clash between the element at the upper level and lower level is good. You can have {a => {a => {...}}}, almost guaranteeing collisions that are at different levels (no problem).
In practice, any given level is likely to have less than 100 values for the hash, and none of them will duplicate at the same level.
To test the hashing algorithm that I accepted (forgot which one, but I didn’t invent it), I downloaded the entire list of CPAN Perl modules, divided all namespaces / modules into unique words, and finally hashed everyone who looked for collisions, I ran into 0 collisions. This means that the algorithm has a different hash value for each unique word in the CPAN name list (or that I did it wrong). It sounds good enough to me, but it still grumbles in my brain.