Is there a parser / path to parse Wikipedia dump files using Python?

Question

Is there a parser / path to parse Wikipedia dump files using Python?

I have a project in which I collect all Wikipedia articles related to a certain category, pull out a dump from Wikipedia and put it in our db.

So, I have to parse the Wikipedia dump file to get the material. Do we have an efficient parser to do this job? I am a python developer. Therefore, I prefer any parser in python. If you don’t propose one, I’ll try to write a port for it in python and put it on the Internet, so that other people use it or at least try it.

So all I want is a python parser for parsing Wikipedia dump files. I started writing a parser that parses each node and receives material.

+6

python xml parsing wikipedia wiki

None-da Mar 19 '09 at 9:44

source share

5 answers

Swaroop CH · Answer 1 · 2009-03-19T10:00:28+0000

There is sample code for it at http://jjinux.blogspot.com/2009/01/python-parsing-wikipedia-dumps-using.html

James l · Answer 2 · 2009-03-19T10:00:45+0000

I do not know about licensing, but this is implemented in python and includes a source.

Phils · Answer 3 · 2009-05-28T20:23:25+0000

Another good module is mwlib from here - it hurts to install with all the dependencies (at least on Windows), but it works well.

user1698678 · Answer 4 · 2015-01-29T16:46:08+0000

Wiki Parser is a very fast parser for Wikipedia dump files (~ 2 hours to analyze all 55 GB of English Wikipedia). It creates XML that preserves both the content and structure of the article.

Then you can use python to do whatever you want with XML output.

kjschiroo · Answer 5 · 2017-04-07T13:57:46+0000

I would highly recommend mwxml . This is a utility for analyzing Wikimedia partitions written by Aaron Halfaker , a researcher at the Wikimedia Foundation. It can be installed using

pip install mwxml

The use is pretty intuitive, as shown in this example in the documentation :

 >>> import mwxml >>> dump = mwxml.Dump.from_file(open("dump.xml")) >>> print(dump.site_info.name, dump.site_info.dbname) Wikipedia enwiki >>> for page in dump: ... for revision in page: ... print(revision.id) ... 1 2 3

This is part of a large set of data analysis utilities released by the Wikimedia Foundation and its community.

Is there a parser / path to parse Wikipedia dump files using Python?

More articles: