Serialization of deterministic keys

I am writing a display class that is saved to disk. Currently I only allow str keys, but it would be nice if I could use a couple more types: I hope that anything is hashed (i.e. the same requirements as the built-in dict ), but more reasonable I would accepted string, unicode, int and tuples of these types.

To this end, I would like to get a deterministic serialization scheme.

Option 1 - Key Etching

The first thought was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle did not match:

 >>> import pickle >>> import cPickle >>> def dumps(x): ... print repr(pickle.dumps(x)) ... print repr(cPickle.dumps(x)) ... >>> dumps(1) 'I1\n.' 'I1\n.' >>> dumps('hello') "S'hello'\np0\n." "S'hello'\np1\n." >>> dumps((1, 2, 'hello')) "(I1\nI2\nS'hello'\np0\ntp1\n." "(I1\nI2\nS'hello'\np1\ntp2\n." 

Is there a pickle implementation / protocol combination that is deterministic for some set of types (for example, can only use cPickle with protocol 0)?

Option 2 - Replica and ast.literal_eval

Another option is to use repr to reset and ast.literal_eval to load. I wrote a function to determine if a given key can survive in this process (it is quite conservative in the types that it allows):

 def is_reprable_key(key): return type(key) in (int, str, unicode) or (type(key) == tuple and all( is_reprable_key(x) for x in key)) 

The question for this method is that repr itself is deterministic for the types that I have allowed here. I believe that this will not stand the barrier of version 2/3 due to changing str / unicode literals. This also will not work for integers, where 2**32 - 1 < x < 2**64 jumps between 32 and 64-bit platforms. Are there any other conditions (i.e., Are strings serialized differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that violate this, and not necessarily overcome them.

Option 3: Custom Version

Another option, which is probably redundant, is to write my own repr , which smooths out what I know (or suspect, maybe) as a problem. I just wrote an example here: http://gist.github.com/423945

(If all this fails, I can save the key hash along with pickling both the key and value, and then iterate over the lines having the corresponding hash that the one that scatters to the expected key looks for, but it really complicates a few other things, and I would rather not do it. Edit: it turns out that the built-in hash not deterministic on all platforms. Scratch this.)

Any ideas?

+4
source share
4 answers

After reading most of the source (CPython 2.6.5) to implement the representation for the main types, I came to the conclusion (with sufficient confidence) that the repr these types is actually deterministic. But honestly, this was expected.

I believe that the repr method is subject to almost all of the same cases in which the marshal method will be destroyed ( long > 2 ** 32 can never be int on a 32-bit machine, it is not guaranteed that they will not change between versions or translators and etc.).

My solution so far has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values ​​on the different platforms that I use.

Ultimately, the custom replication function will smooth out all the differences in the platform / implementation, but it will certainly overwhelm this project. However, I can do this in the future.

+2
source

Important Note: repr() not deterministic if the dictionary or collection type is embedded in the object you are trying to serialize. The keys can be typed in any order.

For example, print repr({'a':1, 'b':2}) can be printed as {'a':1, 'b':2} or {'b':2, 'a':1} , depending on how Python decides to manage the keys in the dictionary.

+3
source

"Any value that is an acceptable key for the built-in dict" is not possible: such values ​​include arbitrary instances of classes that do not define __hash__ or comparisons, implicitly using their id for hash and comparison purposes, and the id will not be the same even when running one the program itself (unless these runs are absolutely identical in all respects, which is very difficult to organize - identical inputs, identical start times, absolutely identical environments, etc., etc.).

For strings, unicode, ints, and tuples whose elements are of all these types (including nested tuples), the marshal module can help (within the same version of Python: the marshaling code can and does change in different versions). For instance:.

 >>> marshal.dumps(23) 'i\x17\x00\x00\x00' >>> marshal.dumps('23') 't\x02\x00\x00\x0023' >>> marshal.dumps(u'23') 'u\x02\x00\x00\x0023' >>> marshal.dumps((23,)) '(\x01\x00\x00\x00i\x17\x00\x00\x00' 

This is Python 2; Python 3 will be similar (except that all representations of these byte strings will have leading b , but the problem with cosmetics and, of course, u'23' becomes invalid syntax, and '23' becomes Unicode string). You can see the general idea: the leading byte represents a type, for example, u for Unicode strings, i for integers, ( for tuples; then for containers comes (like a little-endian integer) the number of elements, followed by the elements themselves, and the integers are serialized in the form of little-endian. marshal intended for portability on different platforms (for this version, and not for different versions), since it was used as basic serialization in compiled bytecode files ( .pyc or .pyo ).

+1
source

You mentioned a few requirements in the paragraph, and I think you might want to understand them a little more clearly. While I'm going to:

  • You create the SQLite backend mainly for the dictionary.
  • You want the keys to be larger than the basestring type (which types).
  • You want it to withstand the Python 2 -> Python 3 barrier.
  • You want to support large integers above 2 ** 32 as a key.
  • The ability to store infinite values ​​(because you don't want hash collisions).

So, are you trying to create a general solution “it can do everything” or just trying to solve the immediate problem to continue in the current project? You need to spend a little more time to come up with a clear set of requirements.

Using a hash seemed like the best solution for me, but then you complain that you will have multiple lines with the same hash, implying that you are going to store enough values ​​to even worry about the hash.

0
source

Source: https://habr.com/ru/post/1311736/


All Articles