How to clear data from multiple wikipedia pages using python?

I want to capture the age, place of birth and previous occupation of senators. Information for each individual senator is available on Wikipedia, on their respective pages, and there is another page with a table listing all senators by name. How can I go through this list, follow the links to the corresponding pages of each senator and get the information I want?

Here is what I have done so far.

1. (no python). I found out that DBpedia exists and wrote a query to search for senators. Unfortunately, DBpedia did not classify most (if any) of them:

SELECT ?senator, ?country WHERE { ?senator rdf:type <http://dbpedia.org/ontology/Senator> . ?senator <http://dbpedia.org/ontology/nationality> ?country } 

Request results are unsatisfactory.

2. It turned out that there is a python module called wikipedia that allows me to search and retrieve information from individual wiki pages. Used to get a list of senator names from the table, viewing hyperlinks.

 import wikipedia as w w.set_lang('pt') # Grab page with table of senator names. s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0]) # Get links to senator names by removing links of no interest # For each link in the page, check if it a link to a senator page. senators = [name for name in s.links if not # Senator names don't contain digits nor , (any(char.isdigit() or char == ',' for char in name) or # And full names always contain spaces. ' ' not in name)] 

At this moment I was a little lost. Here the list of senators contains all the names of senators, but also other names, for example, the names of parties. The wikipidia module (at least from what I could find in the API documentation) also does not implement functionality to follow links or look up tables.

I saw two related entries here in StackOverflow that seem useful, but both of them ( here and here ) extract information from one page.

Can someone point me to a solution?

Thanks!

+5
source share
1 answer

Ok, so I figured it out (thanks to the comment pointing to BeautifulSoup).

In fact, there is no big secret to achieve what I wanted. I just had to go through the list of BeautifulSoup and save all the links, and then open each saved link with urllib2 , call BeautifulSoup on the answer and .. do it. Here is the solution:

 import urllib2 as url import wikipedia as w from bs4 import BeautifulSoup as bs import re # A dictionary to store the data we'll retrieve. d = {} # 1. Grab the list from wikipedia. w.set_lang('pt') s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0]) html = url.urlopen(s.url).read() soup = bs(html, 'html.parser') # 2. Names and links are on the second column of the second table. table2 = soup.findAll('table')[1] for row in table2.findAll('tr'): for colnum, col in enumerate(row.find_all('td')): if (colnum+1) % 5 == 2: a = col.find('a') link = 'https://pt.wikipedia.org' + a.get('href') d[a.get('title')] = {} d[a.get('title')]['link'] = link # 3. Now that we have the links, we can iterate through them, # and grab the info from the table. for senator, data in d.iteritems(): page = bs(url.urlopen(data['link']).read(), 'html.parser') # (flatten list trick: [a for b in nested for a in b]) rows = [item for table in [item.find_all('td') for item in page.find_all('table')[0:3]] for item in table] for rownumber, row in enumerate(rows): if row.get_text() == 'Nascimento': birthinfo = rows[rownumber+1].getText().split('\n') try: d[senator]['birthplace'] = birthinfo[1] except IndexError: d[senator]['birthplace'] = '' birth = re.search('(.*\d{4}).*\((\d{2}).*\)', birthinfo[0]) d[senator]['birthdate'] = birth.group(1) d[senator]['age'] = birth.group(2) if row.get_text() == 'Partido': d[senator]['party'] = rows[rownumber + 1].getText() if 'Profiss' in row.get_text(): d[senator]['profession'] = rows[rownumber + 1].getText() 

Pretty simple. BeautifulSoup works wonders =)

+1
source

All Articles