Is it possible to implement universal hashing for the whole range of integers?

I read about universal hashing for integers. Assumed condition and a prerequisite is that we choose a prime p greater than the set of all possible keys .

I do not understand about this.

If our key set is of type int , this means that the prime number must be the next larger data type, for example. long

But in the end, everything that we get as a hash would need to be dumped to int in order to index the hash table. Does this not affect the quality of universal hashing (I mean the distribution of keys across buckets)?

+7
types hash integer-hashing
source share
1 answer

If our key set is an integer, this means that the prime must be the next larger data type, for example. long.

It's not a problem. Otherwise, the hash family may not be universal. See below for more details.

But in the end, everything we get, like a hash, must be discarded before int to index the hash table.

Does this downward effect on the quality of universal hashing (I mean the distribution of keys across buckets) somehow?

The answer is no. I'll try to explain.

Whether p different data type or not, it doesn't matter if the hash family is universal. The important thing is that p is equal to or greater than u (the maximum integer of the universe is integers). It is important that p be large enough (i.e. >= u ).

The hash family is universal if the collision probability is equal to or less than 1/m .

So the idea is to keep this restriction.

The value of p , theoretically, can be as long or more. It just has to be whole and simple.

  • u is the size of the domain / universe (or the number of keys). Given the universe U = {0, ..., u-1} , u denotes the size |U| .
  • m - the number of boxes or buckets
  • p is a prime number that must be equal to or greater than n
  • a hash family is defined as H = {h(a,b)(x)} with h(a,b)(x) = ((a * x + b) mod p) mod m . Note that a and b are randomly selected integers (out of all possible integers, therefore theoretically there can be more than p ) modulo a prime p (which can make them either smaller or more than m , the number of bins / buckets) ; but here is the data type (the range of values ​​does not matter). See Integer Hashing on Wikipedia for a notation.
  • Follow the proof on Wikipedia and you conclude that the collision probability is _p/m_ * 1/(p-1) (underscores mean truncation of decimal places). For p >> m ( p significantly greater than m ), the probability tends to 1/m (but this does not mean that the probability will be better, the greater is p ).

In other words, answering your question: p , which is a larger data type, is not a problem here and may even be required. p must be equal to or greater than u , and a and b must be randomly selected integers modulo p , regardless of the number of buckets m . With these restrictions, you can build a universal family of hashes.


Maybe a math example might help

  • Let U be the universe of integers corresponding to an unsigned char (for example, in C). Then U = {0, ..., 255}
  • Let p be (the next one possible) just equal to or greater than 256 . Note that p can be any of these types ( short , int , long be signed or unsigned). The fact is that the data type does not play a role (in programming, the type basically means the range of values.). 257 short , int or long does not matter much here for the sake of the correctness of the mathematical proof. We could also choose a larger p (i.e., a larger data type); this does not alter the validity of the proof.

    • The next possible prime number will be 257 .
    • We say that we have 25 buckets, i.e. m = 25 . This means that the hash family will be universal if the collision probability is equal to or less than 1/25 , i.e. About 0.04 .
    • Enter the values ​​for _p/m_ * 1/(p-1) : _257/25_ * 1/256 = 10/256 = 0.0390625 , which is less than 0.04 . This is a universal hash family with selected parameters.

We could choose buckets m = u = 256 . Then we would have a collision probability of 0.003891050584 , which is less than 1/256 = 0,00390625 . The Hash family is still universal.

Try with m be larger than p , for example. m = 300 . The collision probability is 0, which is less than 1/300 ~= 0.003333333333 . Trivially, we had more buckets than keys. Still universal, no collisions.


Implementation example

We have the following:

  • x type int (element |U| )
  • a , b , p type long
  • m , we will see below in the example

    • Choose p so that it is greater than max u (element |U| ), p is of type long .
    • Choose a and b (modulo p ) randomly. They are of type long , but always < p .
    • For x (of type int from u ) compute ((a*x+b) mod p) . a*x is of type long , (a*x+b) also of type long , and therefore ((a*x+b) mod p also of type long . Note that the result is ((a*x+b) mod p) is equal to < p . Denote this result by h_a_b(x) .
    • h_a_b(x) now taken modulo m , which means that at this stage it depends on the data type m , whether there will be a decrease or not. However, this does not really matter. h_a_b(x) < m , because we take it modulo m . Therefore, the value h_a_b(x) modulo m fits into the data type m . In the event that it must be suppressed, there will be no loss of value. And so you matched the key with the basket / bucket.
+3
source share

All Articles