I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') retrieves the results, I try to retrieve the contents of the column only when it starts with a variable in my configuration file. For example, if a <td> starts with "Street 1", then I want to capture the contents of the <span> this <td> . That way, I can have a tuple of tuples (which takes care of the None values) that I can store in the database.
lxml_parse.py
import lxml.html as lh doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') print rows
test.htm
<tr> <td></td> <td colspan="2"> Street 1:<span class="required"> *</span><br /> <span class="boldred">2100 5th Ave</span> </td> <td colspan="2"> Street 2:<br /> <span class="boldred">Ste 202</span> </td> </tr> <tr> <td></td> <td> City:<span class="required"> *</span><br /> <span class="boldred">NYC</span> </td> <td> State:<br /> <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN> </td> <td> Country:<span class="required"> *</span><br /> <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN> </td> <td> Zip:<br /> <span class="boldred">10022</span> </td> </tr>
Output:
$ python lxml_parse.py ['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']
Parsing against multiple variables is what I'm having problems with:
import lxml.html as lh desiredvars = ['Street 1','Street 2','City', 'State', 'Zip'] doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars) print myresultset
source share