If our key set is an integer, this means that the prime must be the next larger data type, for example. long.
It's not a problem. Otherwise, the hash family may not be universal. See below for more details.
But in the end, everything we get, like a hash, must be discarded before int to index the hash table.
Does this downward effect on the quality of universal hashing (I mean the distribution of keys across buckets) somehow?
The answer is no. I'll try to explain.
Whether p different data type or not, it doesn't matter if the hash family is universal. The important thing is that p is equal to or greater than u (the maximum integer of the universe is integers). It is important that p be large enough (i.e. >= u ).
The hash family is universal if the collision probability is equal to or less than 1/m .
So the idea is to keep this restriction.
The value of p , theoretically, can be as long or more. It just has to be whole and simple.
u is the size of the domain / universe (or the number of keys). Given the universe U = {0, ..., u-1} , u denotes the size |U| .m - the number of boxes or bucketsp is a prime number that must be equal to or greater than n- a hash family is defined as
H = {h(a,b)(x)} with h(a,b)(x) = ((a * x + b) mod p) mod m . Note that a and b are randomly selected integers (out of all possible integers, therefore theoretically there can be more than p ) modulo a prime p (which can make them either smaller or more than m , the number of bins / buckets) ; but here is the data type (the range of values ββdoes not matter). See Integer Hashing on Wikipedia for a notation. - Follow the proof on Wikipedia and you conclude that the collision probability is
_p/m_ * 1/(p-1) (underscores mean truncation of decimal places). For p >> m ( p significantly greater than m ), the probability tends to 1/m (but this does not mean that the probability will be better, the greater is p ).
In other words, answering your question: p , which is a larger data type, is not a problem here and may even be required. p must be equal to or greater than u , and a and b must be randomly selected integers modulo p , regardless of the number of buckets m . With these restrictions, you can build a universal family of hashes.
Maybe a math example might help
- Let U be the universe of integers corresponding to an
unsigned char (for example, in C). Then U = {0, ..., 255} Let p be (the next one possible) just equal to or greater than 256 . Note that p can be any of these types ( short , int , long be signed or unsigned). The fact is that the data type does not play a role (in programming, the type basically means the range of values.). 257 short , int or long does not matter much here for the sake of the correctness of the mathematical proof. We could also choose a larger p (i.e., a larger data type); this does not alter the validity of the proof.
- The next possible prime number will be
257 . - We say that we have
25 buckets, i.e. m = 25 . This means that the hash family will be universal if the collision probability is equal to or less than 1/25 , i.e. About 0.04 . - Enter the values ββfor
_p/m_ * 1/(p-1) : _257/25_ * 1/256 = 10/256 = 0.0390625 , which is less than 0.04 . This is a universal hash family with selected parameters.
We could choose buckets m = u = 256 . Then we would have a collision probability of 0.003891050584 , which is less than 1/256 = 0,00390625 . The Hash family is still universal.
Try with m be larger than p , for example. m = 300 . The collision probability is 0, which is less than 1/300 ~= 0.003333333333 . Trivially, we had more buckets than keys. Still universal, no collisions.
Implementation example
We have the following: