I am writing a display class that is saved to disk. Currently I only allow str keys, but it would be nice if I could use a couple more types: I hope that anything is hashed (i.e. the same requirements as the built-in dict ), but more reasonable I would accepted string, unicode, int and tuples of these types.
To this end, I would like to get a deterministic serialization scheme.
Option 1 - Key Etching
The first thought was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle did not match:
>>> import pickle >>> import cPickle >>> def dumps(x): ... print repr(pickle.dumps(x)) ... print repr(cPickle.dumps(x)) ... >>> dumps(1) 'I1\n.' 'I1\n.' >>> dumps('hello') "S'hello'\np0\n." "S'hello'\np1\n." >>> dumps((1, 2, 'hello')) "(I1\nI2\nS'hello'\np0\ntp1\n." "(I1\nI2\nS'hello'\np1\ntp2\n."
Is there a pickle implementation / protocol combination that is deterministic for some set of types (for example, can only use cPickle with protocol 0)?
Option 2 - Replica and ast.literal_eval
Another option is to use repr to reset and ast.literal_eval to load. I wrote a function to determine if a given key can survive in this process (it is quite conservative in the types that it allows):
def is_reprable_key(key): return type(key) in (int, str, unicode) or (type(key) == tuple and all( is_reprable_key(x) for x in key))
The question for this method is that repr itself is deterministic for the types that I have allowed here. I believe that this will not stand the barrier of version 2/3 due to changing str / unicode literals. This also will not work for integers, where 2**32 - 1 < x < 2**64 jumps between 32 and 64-bit platforms. Are there any other conditions (i.e., Are strings serialized differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that violate this, and not necessarily overcome them.
Option 3: Custom Version
Another option, which is probably redundant, is to write my own repr , which smooths out what I know (or suspect, maybe) as a problem. I just wrote an example here: http://gist.github.com/423945
(If all this fails, I can save the key hash along with pickling both the key and value, and then iterate over the lines having the corresponding hash that the one that scatters to the expected key looks for, but it really complicates a few other things, and I would rather not do it. Edit: it turns out that the built-in hash not deterministic on all platforms. Scratch this.)
Any ideas?