Unique 64-bit URL Identification

This is basically a mathematical problem, but very programming related: if I have 1 billion lines containing URLs and I take the first 64 bits of each of the MD5 hash, what collision frequency should I expect?

How does the response change if I have only 100 million URLs?

It seems to me that collisions will be extremely rare, but these things tend to be confusing.

Would it be better to use something other than MD5? Keep in mind, I'm not looking for security, just a good fast hash function. In addition, good MySQL support.

EDIT : not exactly a duplicate

+7
hash-collision hash birthday-paradox
source share
5 answers

If the first 64 bits of MD5 were a perfect distribution hash, a birthday paradox would still mean that you would run into every 2 ^ 32 URLs. In other words, the probability of a collision is the number of URLs divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for more details.

I would not feel comfortable just throwing out half the bits in MD5; it would be better XOR for high and low 64-bit words to give them the opportunity to mix. Again, MD5 is by no means fast or safe, so I would not bother him. If you want dazzling speed with a good distribution, but without security pretexts, you can try 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details.

+6
source share

You marked it as a "birthday-paradox," I think you already know the answer .

P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!) 

where n is 1 billion in your case.

You will be slightly better off using something other than MD5, because MD5 is a problem with private collusion .

+2
source share

From what I see, you need a hash function with the following requirements:

  • Hash strings of arbitrary length for a 64-bit value
    • To be good is to avoid collisions.
    • Not necessarily one-way (no security required)
    • It is desirable to quickly - which is a necessary characteristic for an application without protection.

This hash function overview may be useful for drilling the function that is most suitable for you.
I suggest trying a few functions from here and characterizing them for your likely set of input (select the several billion URLs you think you'll see).

In fact, you can generate another column similar to this survey for your test URL to characterize and select from existing or any new hash functions (more rows in this table) that you can test. They have the MSVC ++ source code for starters ( link to the ZIP link ).

Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.

+2
source share

If you have 2 ^ n hash possibilities, there is a 50% chance of a collision when you have 2 ^ (n / 2) elements.

eg. if your hash is 64 bits, you have 2 ^ 64 hash possibilities, you will have a 50% chance of collision if you have 2 ^ 32 elements in the collection.

+2
source share

Just using a hash, there's always a chance of a collision. And you do not know in advance that collisions will occur once or twice, or even hundreds or thousands of times in your list of URLs.

Probability is still just probability. His how to roll the dice 10 or 100 times, what are the chances of getting all six? The probability says that it is low, but it can still happen. Maybe even many times in a row ...

So, while the paradoxical birthday shows you how to calculate probabilities, you still need to decide if collisions are acceptable or not.

... and conflicts are acceptable, and hashing remains the right way; find a 64-bit hashing algorithm instead of relying on half-MD5, which has a good distribution. (Although he probably has ...)

+1
source share

All Articles