Efficient way to parse 100 MB of JSON payload

I run cron on my amazon EC2 micro instance every 12 hours. It downloads a 118 MB file and parses it using the json library. This, of course, causes the instance to end in memory. My instance has 416 MB of free memory, but then I run the script, it drops to 6 MB, and then the OS kills it.

I wonder what my options are? Is it possible to effectively analyze this with Ruby, or do I need to go down to a low level, such as C? I can get a more efficient instance of amazon, but I really want to know if this can be done through Ruby.

UPDATE: I looked at the juggle. It can give you json objects as it parses, but the problem is that if your JSON file contains only one root object, then it will be forced to parse ALL the file. My JSON looks like this:

--Root -Obj 1 -Obj 2 -Obj 3 

So if I do:

 parser.parse(file) do |hash| #do something here end 

Since I have only 1 root object, it will parse all JSON. If Obj 1/2/3 were root, then it would work as it would give me them one by one, but my JSON is not like that, and it parses and eats 500 mb of memory ...

UPDATE # 2: Here's a small version of a large 118 megabyte file (7mb):

Gone

This parsing, I didn’t just take some bytes from the file, just so you can see it as a whole. The array I'm looking for is

 events = json['resultsPage']['results']['event'] 

thanks

+8
json ruby amazon-web-services
source share
2 answers

YAJL implements a stream parser. You can use it to read your JSON on the fly so that you can work with the content as it arrives, and then discard them (and the generated data structures from them) after you finish with them. If you are smart at this, it will keep you under the constraints of your memory.

Edit: with your data, you are really interested in pulling parts of the JSON object at a time, and not in parsing the entire object. This is much more complicated, and it actually requires you to implement your own parser. Nuts and bolts are what you want:

  • Step into the array of events
  • For each event in the array, analyze the event
  • Pass the parsed event to some callback function
  • Cancel the parsed event and the original free memory entry for the next event.

This will not work with yogle, as you are dealing with a single object here, and not with multiple objects. To make it work with yajl, you will need to manually parse JSON to detect the boundaries of the event object, and then pass each fragment of the event object to the JSON parser for deserialization. Something like Ragel might make this process easier for you.

Of course, it would be easier to just upgrade the AWS instance.

+6
source share

Something like yaji can parse json as a stream

0
source share

All Articles