Choose birth and death from Wikipedia?

I am trying to write a python program that can look for wikipedia for birth and death dates for people.

For example, Albert Einstein was born: March 14, 1879; died: April 18, 1955.

I started with Get Wikipedia article with Python

import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml') page2 = infile.read() 

It works as much as possible. page2 is the xml representation of the section on the Wikipedia page of Albert Einstein.

And I looked at this tutorial, now that I have a page in xml format ... http://www.travisglines.com/web-coding/python-xml-parser-tutorial , but I don’t know, t understand how to get the information I want (dates of birth and death) from xml. I feel that I have to be around, and yet I do not know how to proceed from here.

EDIT

After a few answers, I installed BeautifulSoup. I'm at the stage where I can print:

 import BeautifulSoup as BS soup = BS.BeautifulSoup(page2) print soup.getText() {{Infobox scientist | name = Albert Einstein | image = Einstein 1921 portrait2.jpg | caption = Albert Einstein in 1921 | birth_date = {{Birth date|df=yes|1879|3|14}} | birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]] | death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}} | death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States | spouse = [[Mileva Marić]] (1903–1919)<br>{{nowrap|[[Elsa Löwenthal]] (1919–1936)}} | residence = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States | citizenship = {{Plainlist| * [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896) * [[Statelessness|Stateless]] (1896–1901) * [[Switzerland]] (1901–1955) * [[Austria–Hungary|Austria]] (1911–1912) * [[German Empire|Germany]] (1914–1933) * United States (1940–1955) }} 

So much closer, but I still don't know how to return death_date in this format. If I don’t start parsing things with re ? I can do this, but it seems to me that I'm using the wrong tool for this job.

+8
python wikipedia wikipedia-api mediawiki-api mediawiki
source share
5 answers

You can use a library like BeautifulSoup or lxml to parse the html / xml response.

You can also take a look at Requests , which has a much cleaner API for requests.


Here is the working code using Requests , BeautifulSoup and re , maybe not the best solution here, but it is quite flexible and can be extended for similar tasks:

 import re import requests from bs4 import BeautifulSoup url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml' res = requests.get(url) soup = BeautifulSoup(res.text, "xml") birth_re = re.search(r'(Birth date(.*?)}})', soup.revisions.getText()) birth_data = birth_re.group(0).split('|') birth_year = birth_data[2] birth_month = birth_data[3] birth_day = birth_data[4] death_re = re.search(r'(Death date(.*?)}})', soup.revisions.getText()) death_data = death_re.group(0).split('|') death_year = death_data[2] death_month = death_data[3] death_day = death_data[4] 

According to @JBernardo's suggestion using JSON and mwparserfromhell best answer to this particular use case is:

 import requests import mwparserfromhell url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json' res = requests.get(url) text = res.json["query"]["pages"].values()[0]["revisions"][0]["*"] wiki = mwparserfromhell.parse(text) birth_data = wiki.filter_templates(matches="Birth date")[0] birth_year = birth_data.get(1).value birth_month = birth_data.get(2).value birth_day = birth_data.get(3).value death_data = wiki.filter_templates(matches="Death date")[0] death_year = death_data.get(1).value death_month = death_data.get(2).value death_day = death_data.get(3).value 
+8
source share

First: the wikipedia API allows you to use JSON instead of XML, and this will greatly simplify the work.

Second one . No need to use HTML / XML parsers at all (the content is not HTML, but the container should be). You need to parse this Wiki format inside the JSON revision tag.

Check out some wiki parsers here.


It seems confusing here that the API allows you to request a specific format (XML or JSON), but it is just a container for some text in real format that you want to parse:

This: {{Birth date|df=yes|1879|3|14}}

Using one of the parsers listed in the link above, you can do this.

+5
source share

First use pywikipedia . It allows you to request the text of the article, template parameters, etc. Through a high-level abstract interface. Secondly, I would go with the Persondata template (look at the end of the article). In addition, in the long run, you may be interested in Wikidata , which will take several months, but it will contain most of the metadata in Wikipedia articles are easily accessible.

+4
source share

The persondata template is persondata deprecated, and you should access Wikidata instead. See Wikidata: data access . My previous (now deprecated) answer from 2012 was as follows:

What you need to do is analyze the {{persondata}} template found in most biographical articles. There are existing tools for easily extracting such data programmatically , with your existing knowledge and other useful answers. I am sure you can do this job.

+1
source share

One alternative in 2019 is to use the Wikidata API, which, among other things, provides biographical data, such as dates of birth and death, in a structured format that is very easy to use without any user parsers. Many Wikipedia articles depend on Wikidata for their information, so in many cases it will be the same as if you were using Wikipedia data.

For example, look at Albert Einstein’s Wikitracks page and find “date of birth” and “date of death”, you will see that they are the same as on Wikipedia. Each entity in Wikidata has a list of “statements”, which are pairs of “properties” and “values”. To find out when Einstein was born and died, we only need to find the appropriate properties in the list of operators, in this case P569 and P570 . To do this programmatically, it is best to access the entity as json, which can be done using the following url structure:

https://www.wikidata.org/wiki/Special:EntityData/Q937.json

And as an example, here is what P569 claims about Einstein:

  "P569": [ { "mainsnak": { "property": "P569", "datavalue": { "value": { "time": "+1879-03-14T00:00:00Z", "timezone": 0, "before": 0, "after": 0, "precision": 11, "calendarmodel": "http://www.wikidata.org/entity/Q1985727" }, "type": "time" }, "datatype": "time" }, "type": "statement", 

You can learn more about accessing Wikidata in this article , as well as how dates are structured in Help: Dates .

0
source share

All Articles