How to split large wikipedia dump.xml.bz2 files in Python?

I am trying to create a standalone wiktionary using dikip files (.xml.bz2) files using Python. I started with this article as a guide. It includes several languages, I wanted to combine all stages as one python project. I found almost all the libraries needed for the process. The only hump now is to efficiently split a large .xml.bz2 file into a number of smaller files for faster analysis during search operations.

I know that the bz2 library exists in python, but it provides only compression and decompression operations. But I need something like bz2recover from the command line, which splits large files into several smaller files.

Another important point is that splitting should not separate the contents of a page that starts with <page> and ends with </page> in a compressed xml document.

Is there a library earlier that could handle this situation, or should the code be written from scratch? (Any outline / pseudo-code will be very useful).

Note. I want the resulting package to be cross-platform compatible, so I could not use OS specific commands.

+4
source share
3 answers

Finally, I wrote Python Script itself:

 import os import bz2 def split_xml(filename): ''' The function gets the filename of wiktionary.xml.bz2 file as input and creates smallers chunks of it in a the diretory chunks ''' # Check and create chunk diretory if not os.path.exists("chunks"): os.mkdir("chunks") # Counters pagecount = 0 filecount = 1 #open chunkfile in write mode chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2") chunkfile = bz2.BZ2File(chunkname(filecount), 'w') # Read line by line bzfile = bz2.BZ2File(filename) for line in bzfile: chunkfile.write(line) # the </page> determines new wiki page if '</page>' in line: pagecount += 1 if pagecount > 1999: #print chunkname() # For Debugging chunkfile.close() pagecount = 0 # RESET pagecount filecount += 1 # increment filename chunkfile = bz2.BZ2File(chunkname(filecount), 'w') try: chunkfile.close() except: print 'Files already close' if __name__ == '__main__': # When the script is self run split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2') 
+12
source

Well, if you have a command line tool that offers you functionality, you can always wrap it in a call using subprocess

+1
source

The method you refer to is a pretty dirty hack :)

I wrote a stand-alone Wikipedia tool, and only a Sax-parsed dump in its entirety. A skip can be used if you just connect uncompressed xml to stdin from a proper bzip2 decompressor. Especially if it is only a Victory.

As an easy way to test, I simply compressed each page and wrote it into one large file and saved the offset and length in cdb (a small keystore). This may be the right decision for you.

Keep in mind that mediawiki markup is the worst part of sh * t that I have come across for a long time. But in the case of wiktionary, I could handle it.

0
source

All Articles