Is there a difference in collision speed between one 32-bit hash and two 16-bit hashes?

I am working on a system in which hash problems will be a problem. Essentially, there is a system that references elements in a hash table + tree structure. However, the system in question first compiles text files containing paths in the structure into a binary file containing hashed values. This is done for performance reasons. However, due to these collisions, it is very bad, because the structure cannot store 2 elements with the same hash function value; the part requesting the item will not have enough information to find out which one is needed.

My initial thought is that 2 hashes, using two different algorithms or the same algorithm twice, with 2 salts will be more resistant to collision. Two elements having the same hash for different hashing algorithms would be very unlikely.

I was hoping to keep the 32-bit hash value for space reasons, so I decided to switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible values ​​of the hash function ...

I know that switching to two 32-bit hashes would be more resistant to collision, but I wonder if switching to 2 16-bit hashes can have at least some gain over one 32-bit hash? I am not the most mathematically inclined person, so I don’t even know how to start checking the answer, but to impose force on him ...

Some background in the system:

Elements are given to the names of people, they are not random strings and usually consist of words, letters and numbers without spaces. This is a nested hash structure, so if you have something like {a => {b => {c => 'blah'}}}, you get the value "blah", getting the value a / b / c compiled the request will have 3 hash values ​​in the immediate sequence, hash values ​​a, b, and then c.

The problem only occurs in a collision at a given level. The clash between the element at the upper level and lower level is good. You can have {a => {a => {...}}}, almost guaranteeing collisions that are at different levels (no problem).

In practice, any given level is likely to have less than 100 values ​​for the hash, and none of them will duplicate at the same level.

To test the hashing algorithm that I accepted (forgot which one, but I didn’t invent it), I downloaded the entire list of CPAN Perl modules, divided all namespaces / modules into unique words, and finally hashed everyone who looked for collisions, I ran into 0 collisions. This means that the algorithm has a different hash value for each unique word in the CPAN name list (or that I did it wrong). It sounds good enough to me, but it still grumbles in my brain.

+7
source share
1 answer

If you have 2 16-bit hashes that produce uncorrelated values, then you just described a 32-bit hash algorithm. It will not be better or worse than any other 32-bit hashing algorithm.

If you are concerned about a collision, make sure you use a hash algorithm that does a good job of hashing your data (some of them are written just for quick calculation, this is not what you want), and increase the size of your hash until you are comfortable .

The question arises of the probability of collisions. It turns out that if you have n things in your collection, there are n * (n-1) / 2 pairs that might collide. If you use hash hash k , the probability of collision of one pair is 2 -k . If you have a lot of things, then the probability of collision of different pairs is almost uncorrelated. This is exactly the situation about which the Poisson distribution describes.

Thus, the number of collisions that you see should approximately correspond to the Poisson distribution with λ = n * (n-1) * 2 -k-1 . Hence the probability of the absence of hash collisions near e . With 32 bits and 100 elements, the chance of a collision at one level is about 1.1525 in a million. If you do this enough time, having enough different data sets, in the end those who have a million chances will add up.

But note that you have a lot of normal sizes and several large, large ones will have a disproportionate impact on your risk of collision. This is because every item that you add to the collection may collide with any of the preceding things - more things equal a higher risk of collision. So, for example, one level with 1000 data elements has about 1 chance in 10,000 unsuccessful attempts - this is about the same risk as 100 levels with 100 data elements.

If the hashing algorithm does not do its job properly, your risk of collision will increase rapidly. How fast is highly dependent on the nature of the failure.

Using these facts and your predictions to use your application, you must decide whether you are comfortable with the risk of 32-bit hashes, or if you need to move on to something more.

+9
source

All Articles