How to parse an HTML table against a list of variables using lxml?

Question

How to parse an HTML table against a list of variables using lxml?

I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') retrieves the results, I try to retrieve the contents of the column only when it starts with a variable in my configuration file. For example, if a <td> starts with "Street 1", then I want to capture the contents of the <span> this <td> . That way, I can have a tuple of tuples (which takes care of the None values) that I can store in the database.

lxml_parse.py

 import lxml.html as lh doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') print rows

test.htm

 <tr> <td></td> <td colspan="2"> Street 1:<span class="required"> *</span><br /> <span class="boldred">2100 5th Ave</span> </td> <td colspan="2"> Street 2:<br /> <span class="boldred">Ste 202</span> </td> </tr> <tr> <td></td> <td> City:<span class="required"> *</span><br /> <span class="boldred">NYC</span> </td> <td> State:<br /> <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN> </td> <td> Country:<span class="required"> *</span><br /> <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN> </td> <td> Zip:<br /> <span class="boldred">10022</span> </td> </tr>

Output:

 $ python lxml_parse.py ['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']

Parsing against multiple variables is what I'm having problems with:

 import lxml.html as lh desiredvars = ['Street 1','Street 2','City', 'State', 'Zip'] doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars) print myresultset

+1

python html lxml

Thinkcode May 17 '12 at 19:46

source share

3 answers

Purpose of creating this dictionary:

 {'City:': 'NYC', 'Zip:': '10022', 'Street 1:': '2100 5th Ave', 'Country:': 'USA', 'State:': 'NY', 'Street 2:': 'Ste 202'}

You can use this code. And then it's easy to query the dictionary to get the values you need:

 import lxml.html as lh test = '''<tr> <td></td> <td colspan="2"> Street 1:<span class="required"> *</span><br /> <span class="boldred">2100 5th Ave</span> </td> <td colspan="2"> Street 2:<br /> <span class="boldred">Ste 202</span> </td> </tr> <tr> <td></td> <td> City:<span class="required"> *</span><br /> <span class="boldred">NYC</span> </td> <td> State:<br /> <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN> </td> <td> Country:<span class="required"> *</span><br /> <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN> </td> <td> Zip:<br /> <span class="boldred">10022</span> </td> </tr>''' outhtml = lh.fromstring(test) ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ] vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') result = dict( zip(ks,vs) ) print result

+1

gauden May 17 '12 at 20:52

source share

I searched the same and found your question and did not answer “correctly”, so I’ll add a couple of points:

To reference variables in XPath, you must use the syntax $ var ,
The lxml variables are passed as arguments to the xpath () keyword,
Using child::* is wrong, since you are looking for text directly inside <td/> ; text() already searching for text child nodes,
You need to use the contains () function of XPath due to spaces.

Given this, your patched code looks like this:

 import lxml.html as lh desiredvars = ['Street 1','Street 2','City', 'State', 'Zip'] doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() myresultset = [(var, outhtml.xpath('//tr/td[contains(text(), $var)]/span[@class="boldred"]/text()', var=var)) for var in desiredvars] print myresultset

0

lispmachine Dec 6 '13 at 8:52

source share

Thinkcode · Accepted Answer · 2012-05-21T15:34:38+0000

lxml_tempsofsol.py :

 import lxml.html as lh desiredvars = ['Street 1','Street 2','City', 'State', 'Zip'] doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars) for each in myresultset: print each

Output:

 $ python lxml_tempsofsol.py ('Street 1', '2100 5th Ave') ('Street 2', 'Ste 202') ('City', 'NYC') ('State', 'NY') ('Zip', '10022')

How to parse an HTML table against a list of variables using lxml?

More articles: