For the NLP project of my project, I want to download a large number of pages from Wikipedia (say 10,000). Without loading the whole XML dump, this is what I can think of:
- Open the Wikipedia page
- Parse the HTML for the links in the First Search Step module and open each page.
- Recursively open links on pages retrieved in 2
In steps 2 and 3, I will leave if I have reached the desired number of pages.
How would you do that? Please suggest the best ideas you can think of.
ANSWER: This is my Python code:
# Get 10000 random pages from Wikipedia. import urllib2 import os import shutil
user59634
source share