I sorted out the data from Wikipedia a bit. I am particularly interested in extracting equations, so I am only interested in part of the file.
First, if you are interested in your WikiMedia data, itβs much easier to get a Labs account. It will take about a day, and this will allow you to run most of the code on your machines, avoiding the need to download several gigabytes. With a Labs account, you should be able to run code on fairly modern database replication, avoiding the need for fully json.
I am using a simple python program to parse data, which basically executes several regular expressions on each line; one to find lines containing <title>...</title> , so I know what this wikipedia article is and a few more to find the namespace and math tags. It can process a 160 MB file in 13 seconds, so it can execute all 36 GB in less than an hour.
This code only creates text files with data that interests me. If you are interested in the code
import sys import re dump = len(sys.argv)>1 and sys.argv[1]=='-d' titleRE = re.compile('<title>(.*)</title>') nsRE = re.compile('<ns>(.*)</ns>') mathRE = re.compile('</?math(.*?)>') pageEndRE = re.compile('</page>') supOc = 0 supCc = 0 subOc = 0 subCc = 0 title ="" attr = "" ns = -1 inEqn = 0 for line in sys.stdin: m = titleRE.search(line) if m : title = m.group(1) expression = "" if dump : print line inEqn = 0 m = nsRE.search(line) if m : ns = m.group(1) start = 0 pos = 0 m = mathRE.search(line,pos) while m : if m.group().startswith('<math'): attr = m.group(1) start = m.end() pos = start expression = "" inEqn = 1 if m.group() == '</math>' : end = m.start() expression = ' '.join([expression,line[start:end]]) print title,'\t',attr,'\t',expression.lstrip().replace('<','<').replace('>', '>').replace('&','&') pos = m.end() expression = "" start = 0 inEqn = 0 m = mathRE.search(line,pos) if start > 0 : expression = line[start:].rstrip() elif inEqn : expression = ' '.join([expression,line.rstrip()])
Sorry if this is a little mysterious, but it was not for public consumption. Sample output
Arithmetic mean a_1,\ldots,a_n. Arithmetic mean A Arithmetic mean A=\frac{1}{n}\sum_{i=1}^{n} a_i Arithmetic mean \bar{x}
Each line has an article title and a latex equation. This reduces the data that I need to work, to more manageable 500 thousand. I'm not sure that such a strategy will work for your application.
For enwiki master data, splitting xml dumps into 27 smaller files, roughly the same size. You can find several files with a reasonable size that are easier to work with than a single giant file or millions of tiny files. In the title of the article, it would be easy to break the first letter, giving less than a hundred files, each of which is less than a gigabyte.