Is there a parser / path to parse Wikipedia dump files using Python?

I have a project in which I collect all Wikipedia articles related to a certain category, pull out a dump from Wikipedia and put it in our db.

So, I have to parse the Wikipedia dump file to get the material. Do we have an efficient parser to do this job? I am a python developer. Therefore, I prefer any parser in python. If you don’t propose one, I’ll try to write a port for it in python and put it on the Internet, so that other people use it or at least try it.

So all I want is a python parser for parsing Wikipedia dump files. I started writing a parser that parses each node and receives material.

+6
python xml parsing wikipedia wiki
source share
5 answers
+3
source share

I do not know about licensing, but this is implemented in python and includes a source.

+1
source share

Another good module is mwlib from here - it hurts to install with all the dependencies (at least on Windows), but it works well.

+1
source share

Wiki Parser is a very fast parser for Wikipedia dump files (~ 2 hours to analyze all 55 GB of English Wikipedia). It creates XML that preserves both the content and structure of the article.

Then you can use python to do whatever you want with XML output.

0
source share

I would highly recommend mwxml . This is a utility for analyzing Wikimedia partitions written by Aaron Halfaker , a researcher at the Wikimedia Foundation. It can be installed using

pip install mwxml 

The use is pretty intuitive, as shown in this example in the documentation :

 >>> import mwxml >>> dump = mwxml.Dump.from_file(open("dump.xml")) >>> print(dump.site_info.name, dump.site_info.dbname) Wikipedia enwiki >>> for page in dump: ... for revision in page: ... print(revision.id) ... 1 2 3 

This is part of a large set of data analysis utilities released by the Wikimedia Foundation and its community.

0
source share

All Articles