Universal hashing
To calculate the probability of collisions with S strings of length L with W bits per character up to a hash of length H bits, assuming the optimal universal hash ( 1 ), you can calculate the probability of a collision based on a hash table of size (number of buckets) of 'N`.
First, we can assume an ideal implementation of the hash table ( 2 ), which perfectly breaks the H bits in the hash into the available N ( 3 ) buckets. This means that H becomes meaningless, with the exception of the limit for N W and 'L' are just the basis for the upper bound for S For simpler mathematics, suppose that the string lengths < L simply padded to L with a special null character. If we were interested, we are interested in the worst case: it is 54 ^ L (26 * 2 + '_' + null), it’s just a ridiculous number, the actual number of entries is more useful than the character set and length, so we’ll just work like this as if S was a variable in its own right.
It remains to try to put S elements in N buckets. This is becoming a very famous issue, a paradoxical birthday
The solution to this for various probabilities and the number of buckets is instructive , but provided that we have 1 billion buckets (approximately 4 GB of memory in a 32 bit system), then we need only 37 thousand records before we reach a 50% probability of them in at least one collision. Given that trying to avoid any collisions in the hash table becomes clearly absurd.
All this does not mean that we should not care about the behavior of our hash functions. Obviously, these numbers imply ideal implementations; they are an upper bound on how good we can get. A bad hash function can lead to much stronger collisions in some areas, discarding part of the possible “space”, never or rarely using it, all of which can cause hashes to be less optimal and even degrade to performance that looks like a list but with much worse persistent factors.
Implementing a .NET hash function of a hash function is small (in the sense that it could be better), but is probably acceptable to the vast majority of users and reasonably efficient to calculate.
Alternative Approach: Perfect Hashing
If you want you to be able to generate so-called perfect hashes , this requires a thorough knowledge of the input values in advance, but this is not so often useful. In simliar veins to the above math, we can show that even perfect hashing has its limits:
Recall the limit of lines 54 ^ L length L However, we only have bits H (suppose 32), which is about 4 billion different numbers. Therefore, if you can have truly any string and any number of them, then you need to satisfy:
54 ^ L <= 2 ^ 32
And having decided this:
log2 (54 ^ L) <= 32 L * log2 54 <= 32 L <= 32 / log2 54 <= 5.56
Since the length of the lines clearly cannot be fractional, you are left with a maximum length of only 5. Very short.
If you know that you only have a set of strings less than 4 billion in size, then perfect hashing will allow you to process any L value, but limiting the set of values can be very difficult in practice, and you must know them all in advance or degrade to the point that compiles the database string → hash and add to it when new lines are encountered.
For this exercise, a universal hash is optimal, since we want to reduce the likelihood of any collision, i.e. for any input, the probability that it has output x from the set of possibilities R is 1 / R.
Note that doing optimal hashing (and internal bucketing) work is pretty tricky, but you should expect inline types to be reasonable, if not always ideal.
In this example, I avoided the question of closed and open addressing. This has something to do with probabilities, but not significantly.