This is basically a mathematical problem, but very programming related: if I have 1 billion lines containing URLs and I take the first 64 bits of each of the MD5 hash, what collision frequency should I expect?
How does the response change if I have only 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would it be better to use something other than MD5? Keep in mind, I'm not looking for security, just a good fast hash function. In addition, good MySQL support.
EDIT : not exactly a duplicate
hash-collision hash birthday-paradox
itsadok
source share