Download one large Python dictionary encoded as Json without memory usage?

I saw a lot of similar questions to this, but nothing that really matched. Most of the other questions seemed to relate to speed. What I'm experiencing is the only json dictionary that is in the 1.1gig file in my local field, occupying all my 16 gigabytes of memory when I try to load it using anything line by line:

f = open(some_file, "rb") new_dictionary = json.load(f) 

This happens regardless of which json library I use (I tried ujson, json, yajl) and regardless of whether I read things as a stream of bytes or not. For me it makes absolutely no sense. What about crazy memory usage, and how do I get around this?

In case this helps, a dictionary is just a bunch of nested dictionaries, all of which have ints pointing to other ints. An example looks like this:

 {"0":{"3":82,"4":503,"15":456},"956":{"56":823,"678":50673,"35":1232}...} 

UPDATE: When I run this using simplejson, it actually only takes 8 gigs. I don’t know why this question is much less than the rest.

UPDATE 2: So, I did some more research. I uploaded my dictionary using simplejson and tried to convert all keys to ints (assuming Liori that strings could take up more space). The space remained unchanged at 8 concerts. Then I tried using Winston Ewert to run gc.collect (). Space still remained at 8 concerts. Finally, annoyed and curious, I pickled my new data structure, exited Python, and rebooted. Listen, it still takes 8 concerts. I think Python just wants a lot of room for a large 2d dictionary. Disappointing, of course, but at least now I know that this is not a JSON problem if I use simplejson to load it.

+7
source share
4 answers
Gabe really understood this in a comment, but since several months passed and he did not publish it as an answer, I decided that I just had to answer my own question, so the posterity sees that there is an answer.

In any case, the answer is that a 2d dictionary just takes up so much space in Python. Each of these dictionaries ends with some overhead space, and since there are so many of them, it takes off from 1.1 to 8 gigabytes, and you can do nothing about it except try to use a different data structure or get more ram.

0
source

You can try using the streaming API:

http://lloyd.github.com/yajl/

of which there are a couple of python shells.

https://github.com/rtyler/py-yajl/

https://github.com/pykler/yajl-py

+3
source

A gc.collect() experimentation on my part suggests that calling gc.collect() after the json object has been parsed reduces memory usage until the moment the object was originally created.

Here are the results I get for using memory on a smaller scale:

 Build. No GC 762912 Build. GC 763000 Standard Json. Unicode Keys. No GC 885216 Standard Json. Unicode Keys. GC 744552 Standard Json. Int Keys. No GC 885216 Standard Json. Int Keys. GC 744724 Simple Json. Unicode Keys. No GC 894352 Simple Json. Unicode Keys. GC 745520 Simple Json. Int Keys. No GC 894352 Simple Json. Int Keys. GC 744884 

In principle, when gc.collect () is run, it cleans up some kind of garbage created during the JSON parsing process.

+2
source

I can’t believe that I'm going to say this, but json is actually a very simple format, it would not be too difficult to create your own parser.

However, this would make sense if:

  • You don't need a complete dictionary at the end (i.e. you can use data while reading)
  • You have a good idea what data structure is located (an arbitrarily deep dictionary will make this much more difficult).
0
source

All Articles