Python Web Scraper with Attribute HTML Tags

I am trying to create a web scraper that will parse a publications webpage and extract authors. The skeletal structure of a web page is as follows:

<html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> 

I am trying to use BeautifulSoup and lxml so far to accomplish this task, but I am not sure how to handle the two div tags and the td tag because they have attributes. In addition to this, I'm not sure if I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?

At the moment, my code is as follows:

  import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&nbsp', '&#160') html=html.replace('&iacute','&#237') root=fromstring(html) 

I understand that many import statements may be redundant, but I just copied what I had in the more source file.

EDIT: I suppose I didn't make it clear enough, but I have some tags on the page that I want to clear.

+7
python lxml screen-scraping beautifulsoup
source share
4 answers

It is not clear to me from your question why you need to worry about div tags - how about just:

 soup = BeautifulSoup(html) thetd = soup.find('td', attrs={'class': 'author'}) print thetd.string 

In the HTML file that you specified, the launch is done exactly:

 ####I want whatever is located here ### 

which you apparently need. Perhaps you can better indicate what you need, and this super-simple snippet does not - a few td tags of all author classes that you need to consider (all? Just some of them?), Maybe some such tag is missing (what do you want to do in this case) and the like. It’s hard to conclude exactly what your specifications are, just from this simple example and redundant code; -).

Edit : if according to the last OP comment there are several such td tags, one per author:

 thetds = soup.findAll('td', attrs={'class': 'author'}) for thetd in thetds: print thetd.string 

... i.e. not much harder! -)

+11
source share

or you can use pyquery since BeautifulSoup is no longer actively supported, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

install pyquery first with

 easy_install pyquery 

then your script may be as simple as

 from pyquery import PyQuery d = PyQuery('http://mywebpage/') allauthors = [ td.text() for td in d('td.author') ] 

pyquery uses css selector syntax familiar with jQuery, which I find more intuitive than BeautifulSoup. It uses lxml from below and is much faster than BeautifulSoup. But BeautifulSoup is pure python and thus works with the Google engine.

+6
source share

The lxml library is now the standard for parsing html in python. The interface may seem awkward at first, but it is very useful for what it does.

You must allow libary to handle the xml specification, such as escape objects and objects;

 import lxml.html html = """<html><body><div id="container"><div id="contents"><table><tbody><tr> <td class="author">####I want whatever is located here, eh? &iacute; ###</td> </tr></tbody></table></div></div></body></html>""" root = lxml.html.fromstring(html) tds = root.cssselect("div#contents td.author") print tds # gives [<Element td at 84ee2cc>] print tds[0].text # what you want, including the 'Γ­' 
+5
source share

BeautifulSoup is by far the canonical HTML parser / processor. But if you only have the snippet you need to map, instead of creating a whole hierarchical object representing HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:

 from pyparsing import makeHTMLTags, withAttribute, SkipTo author_td, end_td = makeHTMLTags("td") # only interested in <td> where class="author" author_td.setParseAction(withAttribute(("class","author"))) search = author_td + SkipTo(end_td)("body") + end_td for match in search.searchString(html): print match.body 

The pyparsing makeHTMLTags function does much more than just highlight the expressions "<tag>" and "</tag>" . It also processes:

  • useless tag matching
  • "<tag/>" syntax
  • an attribute of zero or more in the opening tag
  • random attributes
  • attribute names with namespaces
  • attribute values ​​in single, double or without quotes
  • intermediate space between tag and characters or attribute name, '=' and value
  • attributes are available after parsing in the form of named results

These are common mistakes when considering using regular expressions to clean HTML.

+1
source share

All Articles