The JSON object spans multiple lines. How to split login to Hadoop

I need to swallow large JSON files whose records can span multiple lines (rather than files) (it depends entirely on how the data provider writes it).

Elephant-Bird involves LZO compression, which, as I know, the data provider will not do.

The Dzone article http://java.dzone.com/articles/hadoop-practice makes the assumption that the JSON record will be on the same line.

Any ideas other than JSON file compression ... will be huge ... on how to properly split the file so that JSON doesn't break.

Edit: strings, not files

+6
source share
2 answers

Except for any other suggestions, and depending on how JSON is formatted, you may have an option.

The problem mentioned in the Dzone article is that JSON does not have a leaf element that can be easily found when you go to the split point.

Now, if your JSON input has "pretty" or standard formatting, you can take advantage of this in your custom input implementation.

For example, taking a JSON sample from the Dzone example:

{ "results" : [ { "created_at" : "Thu, 29 Dec 2011 21:46:01 +0000", "from_user" : "grep_alex", "text" : "RT @kevinweil: After a lot of hard work by ..." }, { "created_at" : "Mon, 26 Dec 2011 21:18:37 +0000", "from_user" : "grep_alex", "text" : "@miguno pull request has been merged, thanks again!" } ] } 

with this format, you know (hopefully?) that every new entry starts with a line that has 6 spaces and an open bracket. Recording ends in a similar format - 6 spaces and a closing bracket.

So, your logic in this case: use lines until you find a line with 6 spaces and an open bracket. Then buffer the contents until you find 6 spaces and a closing bracket. Then use any JSON deserializer that you want to turn into a java object (or just pass multi-line text to your cartographer.

+1
source

The best way to split and parse multi-line JSON data is to extend the NLineInputFormat class and define your own idea of ​​what InputSplit is. [For example: 1000 JSON entries may be 1 split]

Then you will need to extend the LineRecordReader class and define your own idea of ​​what constitutes 1 row [in this case, 1 record].

This way you will get well-defined splits, each of which will contain "N" JSON records, which can then be read using the same LineRecordReader, and each of your map tasks will receive one record for processing at a time.

Charles Mengui answers How are Hadoop process records separated by block boundaries? very well explains the nuance in this approach.

For an example of such an NLineInputFormat extension, check out http://hadooped.blogspot.com/2013/09/nlineinputformat-in-java-mapreduce-use.html

A similar multi-line CSV format for Hadoop can be found here: https://github.com/mvallebr/CSVInputFormat

Update . I found the appropriate multi-line JSON input format for Hadoop here: https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input /JsonInputFormat.java

0
source

Source: https://habr.com/ru/post/922754/


All Articles