Dictionary hashing?

For caching purposes, I need to generate a cache key from the GET arguments that are present in the dict.

I am currently using sha1(repr(sorted(my_dict.items()))) ( sha1() is a handy method that uses hashlib inside), but I'm curious if there is a better way.

+130
python dictionary hash
May 04 '11 at 13:19
source share
11 answers

If your dictionary is not nested, you can create a freezer with dict elements and use hash() :

 hash(frozenset(my_dict.items())) 

This is much less computationally demanding than generating a JSON string or representing a dictionary.

UPDATE: Please see the comments below why this approach cannot give a stable result.

+97
May 4 '11 at 1:24 pm
source share

Using sorted(d.items()) not enough to get a stable view. Some of the values ​​in d can also be dictionaries, and their keys will still be issued in random order. While all the keys are strings, I prefer to use:

 json.dumps(d, sort_keys=True) 

However, if the hashes need to be stable on different machines or versions of Python, I'm not sure if this is bulletproof. You might want to add separators and ensure_ascii arguments to protect yourself from any default changes. I will be grateful for the comments.

+114
Feb 25 '14 at 2:29
source share

EDIT : If all of your keys are strings, then before you continue reading this answer, please check out Jack O'Connor, a much simpler (and faster) solution (which also works for hashing nested dictionaries).

Although the answer is accepted, the title of the question is "Hashing the Python Dictionary", and the answer is incomplete for this title. (As for the main part of the question, the answer is complete.)

Nested Dictionaries

If someone is looking in Kara for a way to hash a dictionary, he may stumble upon this aptly titled question and leave him unsatisfied if someone tries to hash several nested dictionaries. The answer above will not work in this case, and you will have to implement some kind of recursive mechanism to extract the hash.

Here is one such mechanism:

 import copy def make_hash(o): """ Makes a hash from a dictionary, list, tuple or set to any level, that contains only other hashable types (including any lists, tuples, sets, and dictionaries). """ if isinstance(o, (set, tuple, list)): return tuple([make_hash(e) for e in o]) elif not isinstance(o, dict): return hash(o) new_o = copy.deepcopy(o) for k, v in new_o.items(): new_o[k] = make_hash(v) return hash(tuple(frozenset(sorted(new_o.items())))) 

Bonus: hashing objects and classes

The hash() function works great when you hash() classes or instances. However, here is one problem I discovered with a hash regarding objects:

 class Foo(object): pass foo = Foo() print (hash(foo)) # 1209812346789 foo.a = 1 print (hash(foo)) # 1209812346789 

The hash remains the same even after I changed foo. This is because the identity of foo has not changed, so the hash remains the same. If you want foo to hash differently depending on its current definition, the solution is to hash everything that actually changes. In this case, the __dict__ attribute:

 class Foo(object): pass foo = Foo() print (make_hash(foo.__dict__)) # 1209812346789 foo.a = 1 print (make_hash(foo.__dict__)) # -78956430974785 

Alas, when you try to do the same with the class itself:

 print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy' 

The class __dict__ property is not a regular dictionary:

 print (type(Foo.__dict__)) # type <'dict_proxy'> 

Here is a mechanism similar to the previous one that will handle classes accordingly:

 import copy DictProxyType = type(object.__dict__) def make_hash(o): """ Makes a hash from a dictionary, list, tuple or set to any level, that contains only other hashable types (including any lists, tuples, sets, and dictionaries). In the case where other kinds of objects (like classes) need to be hashed, pass in a collection of object attributes that are pertinent. For example, a class can be hashed in this fashion: make_hash([cls.__dict__, cls.__name__]) A function can be hashed like so: make_hash([fn.__dict__, fn.__code__]) """ if type(o) == DictProxyType: o2 = {} for k, v in o.items(): if not k.startswith("__"): o2[k] = v o = o2 if isinstance(o, (set, tuple, list)): return tuple([make_hash(e) for e in o]) elif not isinstance(o, dict): return hash(o) new_o = copy.deepcopy(o) for k, v in new_o.items(): new_o[k] = make_hash(v) return hash(tuple(frozenset(sorted(new_o.items())))) 

You can use this to return a hash tuple from any number of elements that you need:

 # -7666086133114527897 print (make_hash(func.__code__)) # (-7666086133114527897, 3527539) print (make_hash([func.__code__, func.__dict__])) # (-7666086133114527897, 3527539, -509551383349783210) print (make_hash([func.__code__, func.__dict__, func.__name__])) 

NOTE: all of the above code assumes the use of Python 3.x. Not tested in earlier versions, although I assume that make_hash() will work, say, in 2.7.2. As for the examples, I know that

 func.__code__ 

should be replaced by

 func.func_code 
+59
Jan 03 '12 at 15:05
source share

Here is a clearer solution.

 def freeze(o): if isinstance(o,dict): return frozenset({ k:freeze(v) for k,v in o.items()}.items()) if isinstance(o,list): return tuple([freeze(v) for v in o]) return o def make_hash(o): """ makes a hash out of anything that contains only list,dict and hashable types including string and numeric types """ return hash(freeze(o)) 
+12
Feb 06 '14 at 21:13
source share

The code below avoids the use of the Python hash () function, since it will not provide hashes that are consistent between restarts of the Python function (see hash functions in Python 3.3, returns different results between sessions ). make_hashable() converts the object to nested tuples, and make_hash_sha256() also converts repr() to a base64 encoded SHA256 hash.

 import hashlib import base64 def make_hash_sha256(o): hasher = hashlib.sha256() hasher.update(repr(make_hashable(o)).encode()) return base64.b64encode(hasher.digest()).decode() def make_hashable(o): if isinstance(o, (tuple, list)): return tuple((make_hashable(e) for e in o)) if isinstance(o, dict): return tuple(sorted((k,make_hashable(v)) for k,v in o.items())) if isinstance(o, (set, frozenset)): return tuple(sorted(make_hashable(e) for e in o)) return o o = dict(x=1,b=2,c=[3,4,5],d={6,7}) print(make_hashable(o)) # (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1)) print(make_hash_sha256(o)) # fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y= 
+7
Feb 10 '17 at 5:09
source share

Updated since 2013 ...

None of the above answers seem reliable to me. The reason is the use of items (). As far as I know, this happens in a machine-dependent manner.

How about this?

 import hashlib def dict_hash(the_dict, *ignore): if ignore: # Sometimes you don't care about some items interesting = the_dict.copy() for item in ignore: if item in interesting: interesting.pop(item) the_dict = interesting result = hashlib.sha1( '%s' % sorted(the_dict.items()) ).hexdigest() return result 
+5
Mar 04 '13 at 18:10
source share

To keep the order of the keys, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) , I would prefer a quick and dirty solution:

 from pprint import pformat h = hash(pformat(dictionary)) 

It will work even for types such as DateTime and much more that are not serializable JSON.

+4
Jan 30 '15 at 0:45
source share

You could use a third-party frozendict module to freeze your dictation and make it a hash.

 from frozendict import frozendict my_dict = frozendict(my_dict) 

To process nested objects you can use:

 import collections.abc def make_hashable(x): if isinstance(x, collections.abc.Hashable): return x elif isinstance(x, collections.abc.Sequence): return tuple(make_hashable(xi) for xi in x) elif isinstance(x, collections.abc.Set): return frozenset(make_hashable(xi) for xi in x) elif isinstance(x, collections.abc.Mapping): return frozendict({k: make_hashable(v) for k, v in x.items()}) else: raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__)) 

If you want to support more types, use functools.singledispatch (Python 3.7):

 @functools.singledispatch def make_hashable(x): raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__)) @make_hashable.register def _(x: collections.abc.Hashable): return x @make_hashable.register def _(x: collections.abc.Sequence): return tuple(make_hashable(xi) for xi in x) @make_hashable.register def _(x: collections.abc.Set): return frozenset(make_hashable(xi) for xi in x) @make_hashable.register def _(x: collections.abc.Mapping): return frozendict({k: make_hashable(v) for k, v in x.items()}) # add your own types here 
+3
Oct 28 '18 at 18:54
source share

The general approach is good, but you can consider hashing.

SHA was designed for cryptographic strength (speed too, but strength is more important). You can take this into account. Therefore, using the hash built-in function is probably a good idea, unless security is the key here.

0
May 04 '11 at 13:24
source share

You can use the maps library for this. In particular, maps.FrozenMap

 import maps fm = maps.FrozenMap(my_dict) hash(fm) 

To install maps , simply do:

 pip install maps 



It also handles the nested dict case:

 import maps fm = maps.FrozenMap.recurse(my_dict) hash(fm) 



Disclaimer: I am the author of the maps library.

0
Nov 08 '18 at 19:08
source share

I do it like this:

 hash(str(my_dict)) 
-8
Dec 26 '14 at 23:53
source share



All Articles