Fast JSON serialization (and comparison with Pickle) for cluster computing in Python?

I have a set of data points, each of which is described by a dictionary. The processing of each data point is independent, and I send each to a separate task to the cluster. Each data point has a unique name, and my cluster view wrapper simply calls a script that accepts a data point name and a file that describes all data points. The script then accesses the data point from the file and performs the calculations.

Since each task should load the set of all points only to get the point to be started, I would like to optimize this step by serializing the file describing the set of points into an easily extractable format.

I tried using JSONpickle using the following method to serialize a dictionary describing all data points in a file:

def json_serialize(obj, filename, use_jsonpickle=True):
    f = open(filename, 'w')
    if use_jsonpickle:
    import jsonpickle
    json_obj = jsonpickle.encode(obj)
    f.write(json_obj)
    else:
    simplejson.dump(obj, f, indent=1)   
    f.close()

The dictionary contains very simple objects (lists, strings, floats, etc.) and contains a total of 54,000 keys. The json file size is ~ 20 megabytes.

It takes ~ 20 seconds to load this file into memory, which seems very slow to me. I switched to using pickle with the same exact object and found that it generates a file of about 7.8 megabytes in size and can be loaded in ~ 1-2 seconds. This is a significant improvement, but it still seems like loading a small object (less than 100,000 records) should be faster. Also, the pickle is not human readable, which was a big JSON advantage for me.

JSON ? , ?

( "" , , script, ? , ).

.

+5
2

marshal , pickle - , cPickle ( , -1). , , , :

import pickle
import cPickle
import marshal
import json

def maked(N=5400):
  d = {}
  for x in range(N):
    k = 'key%d' % x
    v = [x] * 5
    d[k] = v
  return d
d = maked()

def marsh():
  return marshal.dumps(d)

def pick():
  return pickle.dumps(d)

def pick1():
  return pickle.dumps(d, -1)

def cpick():
  return cPickle.dumps(d)

def cpick1():
  return cPickle.dumps(d, -1)

def jso():
  return json.dumps(d)

def rep():
  return repr(d)

:

$ py26 -mtimeit -s'import pik' 'pik.marsh()'
1000 loops, best of 3: 1.56 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.pick()'
10 loops, best of 3: 173 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.pick1()'
10 loops, best of 3: 241 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.cpick()'
10 loops, best of 3: 21.8 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.cpick1()'
100 loops, best of 3: 10 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.jso()'
10 loops, best of 3: 138 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.rep()'
100 loops, best of 3: 13.1 msec per loop

, json.dumps repr ( Javascript ); marshal, 90 , json; cPickle ( , ), json marshal, , marshal ( repr ).

"", ( ) - , , "".

+7

, : . , , Python, JSON , .

( ), marshall. dump() load(), , , . , - .

, , , , pickle.

+1

All Articles