Scan HUGE JSON file for deserialized data in Scala

I need to be able to process large JSON files by creating objects from deserialized substrings as we iterate over / stream in the file.

For example:

Say I can only deserialize to the following cases:

case class Data(val a: Int, val b: Int, val c: Int) 

and expected JSON format:

 { "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ], "bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ], .... MANY ITEMS .... , "qux": [ {"a": 0, "b": 0, "c": 0 } } 

What I would like to do:

 import com.codahale.jerkson.Json val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream) // NOTE: this will not compile since I pulled the "advanceToValue" out of thin air. 

As a last note, I would prefer to find a solution that includes Jerkson or any other libraries that come with the Play platform, but if another Scala library handles this script with greater ease and decent performance: I am not against trying another library. If there is a clean way to manually search through a file, then use the Json library to continue parsing from there: I'm fine with that.

What I don't want to do is swallow the entire file without streaming or using an iterator, since storing the entire file in memory at one time will be overly expensive.

+7
source share
2 answers

Here is my solution to the problem:

 import collection.immutable.PagedSeq import util.parsing.input.PagedSeqReader import com.codahale.jerkson.Json import collection.mutable private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json")) private val clearAndStop = ']' private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = { val str = new StringBuilder() var readerFinal = readerInitial while(!readerFinal.atEnd && !str.endsWith(text)) { str += readerFinal.first readerFinal = readerFinal.rest } if (!str.endsWith(text) || str.contains(clearAndStop)) Taken(readerFinal, None) else Taken(readerFinal, Some(str.toString)) } private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = { var taken = Taken(readerInitial, None) chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString)) taken } def getJsonData() : Seq[Data] = { var data = mutable.ListBuffer[Data]() var taken = takeUntil(fileContent, "\"foo\"") taken = takeUntil(taken.reader, ':', '[') var doneFirst = false while(taken.text != None) { if (!doneFirst) doneFirst = true else taken = takeUntil(taken.reader, ',') taken = takeUntil(taken.reader, '}') if (taken.text != None) { print(taken.text.get) places += Json.parse[Data](taken.text.get) } } data } case class Taken(reader: PagedSeqReader, text: Option[String]) case class Data(val a: Int, val b: Int, val c: Int) 

Provided by. This code does not very accurately handle invalid JSON very cleanly and to use "foo", "bar" and "qux" for several top-level keys will require a forward search (or matching keys from the list of possible vertices), but in general: I consider what does this work do. It is not as functional as we would like, and is not super reliable, but PagedSeqReader definitely does not allow this to get too confused.

+1
source

I did not do this with JSON (and I hope someone comes up with a turnkey solution for you), but I did it using XML and here is a way to process it.

This is basically a simple Map-> Reduce process using a stream parser.

Map (your advanceTo )

Use a stream parser, for example JSON Simple (not tested). When you map your β€œpath” on the callback, collect something below by writing it to the stream (the file is supported or in memory, depending on your data). This will be your foo array in your example. If your cartographer is complex enough, you may need to collect several paths during the step of the map.

Reduce (your stream[Data] )

Since the streams you collected above look pretty small, you probably do not need to display / separate them again, and you can parse them directly in memory as JSON objects / arrays and manipulate them (convert, recombine, etc. ) ..

+2
source

All Articles