Minhash implementation of how to find hash functions for permutations

Question

Minhash implementation of how to find hash functions for permutations

I have a problem with performing mini-chaning. On paper and from reading, I understand the concept, but my problem is the trick substitution. Instead of rearranging the matrix of sets and values, the sentence to implement: "select k (for example, 100) independent hash functions", and then the algorithm says:

for each row r for each column c if c has 1 in row r for each hash function h_i do if h_i(r) is a smaller value than M (i, c) then M(i, c) := h_i(r)

In various small examples and teaching a book, they use only two or three hash functions in the form (h = a * x + b mod p). It is easy to find, but as in practice, how can I find 100 such independent functions.

In the Java example , hash values are generated from only one hash function instead of several hash functions, regardless of the row index. Where is the difference? My question is how to find these independent hash functions, or if there is an approach with a single hash function, how to handle these values in the algorithm?

+4

algorithm hash-function implementation minhash

user2359192 24 sept '13 at 8:50

source share

2 answers

iampat · Answer 1 · 2013-10-08T20:10:33+0000

One easy way is to use a parametric hash family, such as tab hash functions ( http://en.wikipedia.org/wiki/Tabulation_hashing )

In the example book (a * x + b mod p), by choosing different sets (a, b, p), you can have a different hash function. One way to have independent hash functions is to select (a, b, p) prime / co-prime and preferably large numbers

Ben whitmore · Answer 2 · 2017-09-26T01:19:00+0000

According to iampat's answer you can use tab hashing ( http://en.wikipedia.org/wiki/Tabulation_hashing ).

Another very effective option that gives good results is to use one good quality hash function (like FNV_1a) to create a master hash, and then change that using 100 different combinations of XOR and bitroll.

To generate each hash, you accept a master hash, bitrate it at a given distance, then XOR it at a given value. XOR bits and values are randomly selected for each of the 100 hash functions. See this discussion for more information.

Some people recommend multiplying instead of XOR, in which case you can choose primes.

Minhash implementation of how to find hash functions for permutations

More articles: