How to efficiently parse a bigdata json (wikidata) file in C ++?

Question

How to efficiently parse a bigdata json (wikidata) file in C ++?

I have one json file, which is about 36 GB (from wikidata), and I want to access it more efficiently. I am currently using the fastjsons SAX API in C ++, but analyzing the entire file takes my machine about 7415200 ms (= 120 minutes). I want to access json objects inside this file according to one of the two primary keys ("name" or "key-entity" → ie "stack overflow" or "Q549037") that are inside the json object. This means that I have to parse the entire file now in the worst case.

So, I thought of two approaches:

splitting a large file in billions of small files - with a file name that indicates the name / key entity (i.e. Q549037.json / Stack_Overflow.json or Q549037 # Stack_Overflow.json) → not sure about the overload in the repository
building any index from primary keys to the ftell() position in the file. Building an index should take about 120 minutes (e.g., parsing), but access should be faster than
- i.e. use something like two std::unorderedmap (memory issue may occur again)
- index files - create two files: one with items sorted by name, and one sorted by key entity (creating these files will probably take a lot longer due to sorting)

What is the best practice for such a problem? Which approach should be followed? Any other ideas?

+7

c ++ json bigdata wikidata rapidjson

Constantin Feb 08 '15 at 6:57

source share

4 answers

Write your own JSON analyzer that minimizes data distribution and movement. It is also a multi-purpose character for direct ANSI. I once wrote an XML parser to parse 4 GB files. I tried MSXML and Xerces, both had small memory leaks which, when used on this large amount of data, would actually run out of memory. My parser actually stopped allocating memory as soon as it reached the maximum level of nesting.

+1

user2433030 Feb 08 '15 at 19:38

source share

Your definition of the problem does not allow an accurate answer.

I wonder why you would like to stick with JSON in the first place. This is certainly not the best format for quick access to big data.

If you are heavily using your data on wikia, why not convert it to a more manageable format?

It is easy to automate the definition of a database that matches the format of your records and convert a large chunk of JSON into a database record once and for all.

You can stop the database conversion at any time that you like (i.e. store each JSON block in plain text or refine it further).
In the minimum case, you will get a DB table containing your records indexed by name and key.
Of course, less erratic than using your file system as a database (by creating millions of files with a name after the + name) or writing dedicated code to search for entries.

This will probably also save you a lot of disk space, since internal database storage is usually more efficient than just a textual representation.

+1

kuroi neko Feb 13 '15 at 21:53

source share

I sorted out the data from Wikipedia a bit. I am particularly interested in extracting equations, so I am only interested in part of the file.

First, if you are interested in your WikiMedia data, it’s much easier to get a Labs account. It will take about a day, and this will allow you to run most of the code on your machines, avoiding the need to download several gigabytes. With a Labs account, you should be able to run code on fairly modern database replication, avoiding the need for fully json.

I am using a simple python program to parse data, which basically executes several regular expressions on each line; one to find lines containing <title>...</title> , so I know what this wikipedia article is and a few more to find the namespace and math tags. It can process a 160 MB file in 13 seconds, so it can execute all 36 GB in less than an hour.

This code only creates text files with data that interests me. If you are interested in the code

 import sys import re dump = len(sys.argv)>1 and sys.argv[1]=='-d' titleRE = re.compile('<title>(.*)</title>') nsRE = re.compile('<ns>(.*)</ns>') mathRE = re.compile('&lt;/?math(.*?)&gt;') pageEndRE = re.compile('</page>') supOc = 0 supCc = 0 subOc = 0 subCc = 0 title ="" attr = "" ns = -1 inEqn = 0 for line in sys.stdin: m = titleRE.search(line) if m : title = m.group(1) expression = "" if dump : print line inEqn = 0 m = nsRE.search(line) if m : ns = m.group(1) start = 0 pos = 0 m = mathRE.search(line,pos) while m : if m.group().startswith('&lt;math'): attr = m.group(1) start = m.end() pos = start expression = "" inEqn = 1 if m.group() == '&lt;/math&gt;' : end = m.start() expression = ' '.join([expression,line[start:end]]) print title,'\t',attr,'\t',expression.lstrip().replace('&lt;','<').replace('&gt;', '>').replace('&amp;','&') pos = m.end() expression = "" start = 0 inEqn = 0 m = mathRE.search(line,pos) if start > 0 : expression = line[start:].rstrip() elif inEqn : expression = ' '.join([expression,line.rstrip()])

Sorry if this is a little mysterious, but it was not for public consumption. Sample output

 Arithmetic mean a_1,\ldots,a_n. Arithmetic mean A Arithmetic mean A=\frac{1}{n}\sum_{i=1}^{n} a_i Arithmetic mean \bar{x}

Each line has an article title and a latex equation. This reduces the data that I need to work, to more manageable 500 thousand. I'm not sure that such a strategy will work for your application.

For enwiki master data, splitting xml dumps into 27 smaller files, roughly the same size. You can find several files with a reasonable size that are easier to work with than a single giant file or millions of tiny files. In the title of the article, it would be easy to break the first letter, giving less than a hundred files, each of which is less than a gigabyte.

-one

Salix alba Feb 17 '15 at 1:14

source share

Milo yip · Accepted Answer · 2015-02-09T05:25:18+0000

I think the performance issue is not related to parsing. Using the RapidJSON SAX API should provide good performance and usability of memory. If you need to access all the values in JSON, this might be the best solution.

However, from the description of the question, it seems that reading all the values at a time is not your requirement. You want to read some (possibly small amounts) of the values of certain criteria (for example, primary keys). Then reading / parsing everything is not suitable for this case.

You will need an indexing mechanism. This can be done using the file position. If the data in the positions are also valid JSON, you can search and pass it to RapidJSON to parse this JSON value (RapidJSON may stop parsing when parsing full JSON on kParseStopWhenDoneFlag ).

Other parameters convert JSON into some kind of database, either SQL database, key database or custom. Thanks to the provided indexing tools, you must quickly request data. This may take a long time to convert, but good performance for subsequent searches.

Please note that JSON is an exchange format. It was not designed for fast, individual big data queries.

Update. I recently discovered that there is a semi-index project that can meet your needs.

How to efficiently parse a bigdata json (wikidata) file in C ++?

More articles: