I am trying to create a standalone wiktionary using dikip files (.xml.bz2) files using Python. I started with this article as a guide. It includes several languages, I wanted to combine all stages as one python project. I found almost all the libraries needed for the process. The only hump now is to efficiently split a large .xml.bz2 file into a number of smaller files for faster analysis during search operations.
I know that the bz2 library exists in python, but it provides only compression and decompression operations. But I need something like bz2recover from the command line, which splits large files into several smaller files.
Another important point is that splitting should not separate the contents of a page that starts with <page> and ends with </page> in a compressed xml document.
Is there a library earlier that could handle this situation, or should the code be written from scratch? (Any outline / pseudo-code will be very useful).
Note. I want the resulting package to be cross-platform compatible, so I could not use OS specific commands.
user507139
source share