Creating plain text from a Wikipedia database dump

I found a Python script ( here: Wikipedia Extractor ) that can generate plain text from (in English) a Wikipedia dump database . When I use this command (as indicated on the script page):

$ python enwiki-latest-pages-articles.xml WikiExtractor.py -b 500K -o extracted 

I get this error:

File "enwiki-latest-pages-articles.xml", line 1 <mediawiki xmlns = "http://www.mediawiki.org/xml/export-0.8/" xmlns: xsi = "http: //www.w3. org / 2001 / XMLSchema-instance "xsi: schemaLocation =" http://www.mediawiki.org/xml/export-0.8/http://www.mediawiki.org/xml/export-0.8.xsd "version =" 0.8 "xml: lang =" en ">

 ^ SyntaxError: invalid syntax 

I am running a script using Python 2.7.6 and Cygwin on Windows 7.

Hopefully if someone already used this script or experience with Python can help me solve this error.

Thanks in advance!

+7
python database xml shell wikipedia
source share
1 answer

The first argument to python should be the name of the script.

You probably need to replace the xml and py file names:

 $ python WikiExtractor.py enwiki-latest-pages-articles.xml -b 500K -o extracted 
+14
source share

All Articles