Getting a large number (but not all) Wikipedia pages

For the NLP project of my project, I want to download a large number of pages from Wikipedia (say 10,000). Without loading the whole XML dump, this is what I can think of:

  • Open the Wikipedia page
  • Parse the HTML for the links in the First Search Step module and open each page.
  • Recursively open links on pages retrieved in 2

In steps 2 and 3, I will leave if I have reached the desired number of pages.

How would you do that? Please suggest the best ideas you can think of.

ANSWER: This is my Python code:

# Get 10000 random pages from Wikipedia. import urllib2 import os import shutil #Make the directory to store the HTML pages. print "Deleting the old randompages directory" shutil.rmtree('randompages') print "Created the directory for storing the pages" os.mkdir('randompages') num_page = raw_input('Number of pages to retrieve:: ') for i in range(0, int(num_page)): opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/wiki/Special:Random') page = infile.read() # Write it to a file. # TODO: Strip HTML from page f= open('randompages/file'+str(i)+'.html','w') f.write(page) f.close() print "Retrieved and saved page",i+1 
+4
source share
6 answers
 for i = 1 to 10000 get "http://en.wikipedia.org/wiki/Special:Random" 
+23
source

Wikipedia has an API . Using this API, you can get any random article in a specific namespace:

 http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5 

and for each article you invoke, you will also get wiki text:

 http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content 
+20
source

I would go the other way around: start with an XML dump, and then throw away what you don't want.

In your case, if you are looking for natural language processing, I would suggest that you are interested in pages with full sentences, not lists of links. If you scroll through the links in the form in which you are describing, you will come across many pages of links.

And why are you avoiding XML when you get the benefits of using XML parsing tools to make the selection process easier for you?

+1
source

You may be able to complete the entire request:

http://cs.fit.edu/~mmahoney/compression/enwik8.zip

is a ZIP file containing 100 MB of Wikipedia already pulled out for you. Associated file ~ 16 MB in size.

0
source

Check out the DBpedia project .

There are small downloadable snippets with at least some article URLs. After you take apart 10,000, you can load them in batch mode ...

0
source

I know that this was a long time ago, but for those who are still looking for an effective way to crawl and load a large number of pages on Wikipedia (or the entire Wikipedia) without violating the robot.txt file, it is useful to use the Webb library. Here is the link:

Webb library for crawling and writing off web pages

0
source

All Articles