How to create a unique equal hash for equal dictionaries?

eg:

>>> a = {'req_params': {'app': '12345', 'format': 'json'}, 'url_params': {'namespace': 'foo', 'id': 'baar'}, 'url_id': 'rest'} >>> b = {'req_params': { 'format': 'json','app': '12345' }, 'url_params': { 'id': 'baar' ,'namespace':'foo' }, 'url_id': 'REST'.lower() } >>> a == b True 

What is a good hash function to create equal hashes for both dicts? Dictionaries will have basic data types such as int, list, dict and string, no other objects.

It would be great if the hash was optimized in space, the target set is about 5 million objects, so the likelihood of a collision is pretty low.

I'm not sure if json.dumps or other serializers respect equality instead of the structure of members in a dictionary. eg. Basic hashing using str dict does not work:

 >>> a = {'name':'Jon','class':'nine'} >>> b = {'class':'NINE'.lower(),'name':'Jon'} >>> str(a) "{'name': 'Jon', 'class': 'nine'}" >>> str(b) "{'class': 'nine', 'name': 'Jon'}" 

json.dumps doesn't work either:

 >>> import json,hashlib >>> a = {'name':'Jon','class':'nine'} >>> b = {'class':'NINE'.lower(),'name':'Jon'} >>> a == b True >>> ha = hashlib.sha256(json.dumps(a)).hexdigest() >>> hb = hashlib.sha256(json.dumps(b)).hexdigest() >>> ha '545af862cc4d2dd1926fe0aa1e34ad5c3e8a319461941b33a47a4de9dbd7b5e3' >>> hb '4c7d8dbbe1f180c7367426d631410a175d47fff329d2494d80a650dde7bed5cb' 
+8
python
source share
4 answers

Pprint module sorts dict keys

 from pprint import pformat hash(pformat(a)) == hash(pformat(b)) 

If you want to save hashes, you must use the hash from hashlib. sha1 is much

+11
source share

Why don't you sort before hashing? Of course, this may take a negligible time, but at least you can continue to use the “good” hash function, that is, one that shows good variance plus all other desired properties. Moreover, if the idea is to save space, it is probably because you expect a large number of entries in the dictionary, so the time saved when not sorting the set when using the “good” hash function will certainly be determined by time search when using a "bad" hash function as a result of a large number of collisions.

+1
source share

Not sure if this is what you want:

 import json import hashlib a = # as above b = # as above c = {'req_params': {'app': '12345', 'format': 'json'}, 'url_params': {'id':'baar', 'namespace': 'foo' }, 'url_id': 'rest'} d = {'url_params': {'id':'baar', 'namespace': 'foo' }, 'req_params': {'app': '12345', 'format': 'json'}, 'url_id': 'rest'} ha = hashlib.sha256(json.dumps(a)).hexdigest() hb = hashlib.sha256(json.dumps(b)).hexdigest() hc = hashlib.sha256(json.dumps(c)).hexdigest() hd = hashlib.sha256(json.dumps(d)).hexdigest() assert ha == hb assert ha == hc assert ha == hd 
0
source share

You can also hash the sorted string:

 >>> a = {'name':'Jon','class':'nine'} >>> b = {'class':'NINE'.lower(),'name':'Jon'} >>> def isdeq(d1,d2): ... def dhash(d): ... return hash(str({k:d[k] for k in sorted(d)})) ... return dhash(d1)==dhash(d2) ... >>> isdeq(a,b) True >>> isdeq({'name':'Jon','class':'nine'},{'name':'jon','class':'nine'}) False >>> isdeq({'name':'Jon','class':'nine'},{'class':'nine','name':'Jon'}) True 
0
source share

All Articles