Hash function generator in python

I am looking for a hash family generator that could generate a family of hash functions defined by a set of parameters. So far I have not found such a generator. Is there a way to do this with the hashlib package?

For example, I would like to do something like:

 h1 = hash_function(1) h2 = hash_function(2) ... 

and h1 and h2 will be different hash functions.

For those of you who may know about this, I am trying to implement a minimum hashing algorithm on a very large dataset.

Basically, I have a very large set of functions (from 100 million to 1 billion) for this document, and I need to create from 1000 to 10,000 different random permutations for this set of functions.

I DO NOT want to build random permutations explicitly, so the method I would like to use in the following:

  • generate a hash function h and consider that for two indices r and s
  • r appears before s in the permutation if h(r) < h(s) and does this for 100 - 1000 different hash functions.

Are there any famous libraries that I could skip? Or any standard way to generate python hash families that you might know about?

+7
python generator hash
source share
3 answers

I would just do something like (if you don't need thread safety - it's not hard to change if you need thread safety - and the 32-bit version of Python is assumed):

 import random _memomask = {} def hash_function(n): mask = _memomask.get(n) if mask is None: random.seed(n) mask = _memomask[n] = random.getrandbits(32) def myhash(x): return hash(x) ^ mask return myhash 
+6
source share

You must use universal hashing. My answer and code can be found here: https://stackoverflow.com/a/167883/ ...

0
source share

As mentioned above, you can use universal hashing for minhash. For example:

 import random def minhash(): d1 = set(random.randint(0, 2000) for _ in range(1000)) d2 = set(random.randint(0, 2000) for _ in range(1000)) jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2)) print("jaccard similarity: {}".format(jacc_sim)) N_HASHES = 200 hash_funcs = [] for i in range(N_HASHES): hash_funcs.append(universal_hashing()) m1 = [min([h(e) for e in d1]) for h in hash_funcs] m2 = [min([h(e) for e in d2]) for h in hash_funcs] minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES print("min-hash similarity: {}".format(minhash_sim)) def universal_hashing(): def rand_prime(): while True: p = random.randrange(2 ** 32, 2 ** 34, 2) if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)): return p m = 2 ** 32 - 1 p = rand_prime() a = random.randint(0, p) if a % 2 == 0: a += 1 b = random.randint(0, p) def h(x): return ((a * x + b) % p) % m return h 

Link

0
source share

All Articles