I have a set of data points, each of which is described by a dictionary. The processing of each data point is independent, and I send each to a separate task to the cluster. Each data point has a unique name, and my cluster view wrapper simply calls a script that accepts a data point name and a file that describes all data points. The script then accesses the data point from the file and performs the calculations.
Since each task should load the set of all points only to get the point to be started, I would like to optimize this step by serializing the file describing the set of points into an easily extractable format.
I tried using JSONpickle using the following method to serialize a dictionary describing all data points in a file:
def json_serialize(obj, filename, use_jsonpickle=True):
f = open(filename, 'w')
if use_jsonpickle:
import jsonpickle
json_obj = jsonpickle.encode(obj)
f.write(json_obj)
else:
simplejson.dump(obj, f, indent=1)
f.close()
The dictionary contains very simple objects (lists, strings, floats, etc.) and contains a total of 54,000 keys. The json file size is ~ 20 megabytes.
It takes ~ 20 seconds to load this file into memory, which seems very slow to me. I switched to using pickle with the same exact object and found that it generates a file of about 7.8 megabytes in size and can be loaded in ~ 1-2 seconds. This is a significant improvement, but it still seems like loading a small object (less than 100,000 records) should be faster. Also, the pickle is not human readable, which was a big JSON advantage for me.
JSON ? , ?
( "" , , script, ? , ).
.