I am trying to write a python program that can look for wikipedia for birth and death dates for people.
For example, Albert Einstein was born: March 14, 1879; died: April 18, 1955.
I started with Get Wikipedia article with Python
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml') page2 = infile.read()
It works as much as possible. page2 is the xml representation of the section on the Wikipedia page of Albert Einstein.
And I looked at this tutorial, now that I have a page in xml format ... http://www.travisglines.com/web-coding/python-xml-parser-tutorial , but I don’t know, t understand how to get the information I want (dates of birth and death) from xml. I feel that I have to be around, and yet I do not know how to proceed from here.
EDIT
After a few answers, I installed BeautifulSoup. I'm at the stage where I can print:
import BeautifulSoup as BS soup = BS.BeautifulSoup(page2) print soup.getText() {{Infobox scientist | name = Albert Einstein | image = Einstein 1921 portrait2.jpg | caption = Albert Einstein in 1921 | birth_date = {{Birth date|df=yes|1879|3|14}} | birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]] | death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}} | death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States | spouse = [[Mileva Marić]]&nbsp;(1903–1919)<br>{{nowrap|[[Elsa Löwenthal]]&nbsp;(1919–1936)}} | residence = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States | citizenship = {{Plainlist| * [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896) * [[Statelessness|Stateless]] (1896–1901) * [[Switzerland]] (1901–1955) * [[Austria–Hungary|Austria]] (1911–1912) * [[German Empire|Germany]] (1914–1933) * United States (1940–1955) }}
So much closer, but I still don't know how to return death_date in this format. If I don’t start parsing things with re ? I can do this, but it seems to me that I'm using the wrong tool for this job.
python wikipedia wikipedia-api mediawiki-api mediawiki
JBWhitmore
source share