How can I override the JSONEncoder behavior for binary data?

Question

How can I override the JSONEncoder behavior for binary data?

I work in Python 2.7.10 and I have some binary data:

binary_data = b'\x01\x03\x00\x00 \xe6\x10\x00\x00\x01\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

(If you're really interested, Extended WKB geometry.)

Actually, I have this data somewhere inside a dict :

 my_data = { 'something1': 5.5, 'something2': u'Some info', 'something3': b'\x01\x03\x00\x00 \xe6\x10\x00\x00\x01\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', }

I want to serialize this for JSON to save it. The problem is that I get an error because json trying to interpret it incorrectly as UTF-8:

 >>> json.dumps(my_data) Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Python\27\Lib\json\__init__.py", line 243, in dumps return _default_encoder.encode(obj) File "C:\Python\27\Lib\json\encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "C:\Python\27\Lib\json\encoder.py", line 270, in iterencode return _iterencode(o, 0) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 5: invalid continuation byte

I could encode it manually:

 my_serializable_data = dict(my_data.items()) my_serializable_data['something3'] = binascii.b2a_base64(my_serializable_data['something3']) json.dumps(my_serializable_data)

gives a pleasant

 '{"something2": "Some info", "something3": "AQMAACDmEAAAAQAAAAUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADwPwAAAAAAAPA/AAAAAAAA8D8AAAAAAADwPwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==\\n", "something1": 5.5}'

But that would be cumbersome, as I would need to repeat this throughout the application. I would rather configure json behavior for this binary. You usually tell json how to serialize something by overriding JSONEncoder.default as follows:

 class MyJsonEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, str): return binascii.b2a_base64(o) return super(MyJsonEncoder, self).default(o)

But this does not work, apparently because str handling is hardcoded in JSONEncoder :

 >>> json.dumps(my_data, cls=MyJsonEncoder) Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Python\27\Lib\json\__init__.py", line 250, in dumps sort_keys=sort_keys, **kw).encode(obj) File "C:\Python\27\Lib\json\encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "C:\Python\27\Lib\json\encoder.py", line 270, in iterencode return _iterencode(o, 0) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 5: invalid continuation byte

Overriding JSONEncoder.encode should work, but I will need to restore significant logic from the built-in library, since this method knows how to dig out arbitrary levels and combinations of list and dict s. I would rather not do this; he will be terribly fast and error prone. (Also, looking at the source code, it looks like this logic could be in the global methods of the module in json , which makes this idea even more messy.)

It is important to note that deserializing it for subsequent consumption is not a problem for this situation. This is for journal purposes; when this data is deserialized, it will be displayed by the developer. If they really need to do something with the data, they can simply decode it manually. I also want to make a compromise so that if some text is presented as str rather than unicode , it will still get base64 encoding. (As an alternative, I could only reconsider my code in base64, encode it if it contains any non-ASCII characters for printing, but I can't even make this decision until I can solve the problem I'm asking here.)

So, how can I cancel this behavior without trying to rebuild too much JSONEncoding ?

+4

json python python-2.7

jpmc26 Aug 7 '15 at 4:07

source share

1 answer

metatoaster · Accepted Answer · 2015-08-07T06:10:10+0000

You really do not need to reconstruct everything as such. A cheap solution is to do what you suggested and override encode , but create a new dict with cleared data.

However, if you want the flexibility for arbitrary input to process binary data without having to override everything you can choose for a monkey patch, a couple of functions work in the json.encoder module. A controlled way to do this is to use a special encoder to ensure that the default behavior is otherwise untouched.

 import json import json.encoder import binascii _default_encode_basestring = json.encoder.encode_basestring _default_encode_basestring_ascii = json.encoder.encode_basestring_ascii def _check_string(s): if isinstance(s, str): try: s.decode('utf8') except UnicodeDecodeError: return False return True def _encode_basestring(s): if not _check_string(s): s = binascii.b2a_base64(s) return _default_encode_basestring(s) def _encode_basestring_ascii(s): if not _check_string(s): s = binascii.b2a_base64(s) return _default_encode_basestring_ascii(s) class MyJsonEncoder(json.JSONEncoder): def encode(self, o): json.encoder.encode_basestring = _encode_basestring json.encoder.encode_basestring_ascii = _encode_basestring_ascii result = super(MyJsonEncoder, self).encode(o) json.encoder.encode_basestring = _default_encode_basestring json.encoder.encode_basestring_ascii = _default_encode_basestring_ascii return result

A simple example:

 >>> my_data = { ... 'something1': 5.5, ... 'something2': u'Some info', ... 'something3': b'\x01\x03\x00\x00 ...\x00\x00', ... } >>> import json >>> r = json.dumps(my_data, cls=MyJsonEncoder) >>> print r {"something2": "Some info", "something3": "AQMAACDm...AAAA==\n", "something1": 5.5} >>> r = json.dumps(my_data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/json/__init__.py", line 243, in dumps return _default_encoder.encode(obj) File "/usr/lib/python2.7/json/encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode return _iterencode(o, 0) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 5: invalid continuation byte

Nested Test.

 >>> json.dumps({'some': {'nested': {'data': [b'\xe0\x01\x02\x03?']}}}, cls=MyJsonEncoder) '{"some": {"nested": {"data": ["4AECAz8=\\n"]}}}'

How can I override the JSONEncoder behavior for binary data?

More articles: