Python Compressing a series of JSON objects while maintaining sequential reads?

I have a bunch of json objects that I need to compress, since it consumes too much disk space, approximately 20 gigs cost several million of them.

Ideally, what I would like to do is compress each one individually, and then when I need to read them, just iteratively download and unzip each one. I tried to do this by creating a text file with each line being a compressed json object via zlib, but this does not work with

decompress error due to a truncated stream ,

which, I believe, is associated with compressed lines containing newlines.

Does anyone know a good way to do this?

+6
source share
2 answers

Just use the gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line and read them line by line.

The object transparently compresses and buffers the reading, unpacking the cartridges as needed.

 import gzip import json # writing with gzip.GzipFile(jsonfilename, 'w') as outfile: for obj in objects: outfile.write(json.dumps(obj) + '\n') # reading with gzip.GzipFile(jsonfilename, 'r') as isfile: for line in infile: obj = json.loads(line) # process obj 

This has the added benefit that the compression algorithm can use object repetition for compression ratios.

+24
source

You might want to try an incremental json parser like jsaone .

That is, create one json with all your objects and parse it like

 with gzip.GzipFile(file_path, 'r') as f_in: for key, val in jsaone.load(f_in): ... 

This is very similar to Martin's answer, having spent a little more space, but perhaps a little more comfortable.

EDIT: Oh, by the way, it's probably fair to clarify what jsaone wrote.

0
source

All Articles