I am trying to create a web scraper that will parse a publications webpage and extract authors. The skeletal structure of a web page is as follows:
<html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html>
I am trying to use BeautifulSoup and lxml so far to accomplish this task, but I am not sure how to handle the two div tags and the td tag because they have attributes. In addition to this, I'm not sure if I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?
At the moment, my code is as follows:
import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace(' ', ' ') html=html.replace('í','í') root=fromstring(html)
I understand that many import statements may be redundant, but I just copied what I had in the more source file.
EDIT: I suppose I didn't make it clear enough, but I have some tags on the page that I want to clear.
python lxml screen-scraping beautifulsoup
Gobiaskoffi
source share