How to extract links from a webpage using lxml, XPath and Python?

Question

How to extract links from a webpage using lxml, XPath and Python?

I have this xpath request:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It retrieves all the links with the title attribute and gives the FireFox Xpath checkerhref in addition .

However, I can not use it with lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This does not produce a result from lxml(empty list).

How would I get the text of a hyperlink containing the attribute header with lxmlunder Python?

+5

python hyperlink extraction lxml screen-scraping

torger Jan 18 '10 at 8:22

source share

2 answers

Firefox html- html , xpath, firebug, html, ( urllib/2).

<tbody> .

+2

mrmagooey 06 . '11 1:48

jkp · Accepted Answer · 2010-01-18T09:03:58+0000

I managed to get it to work with the following code:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

How to extract links from a webpage using lxml, XPath and Python?

More articles: