How to do case insensitive Python XPath searches using lxml?

Question

How to do case insensitive Python XPath searches using lxml?

I am trying to map a country or country using the lower-case function in XPath. translate is useless, so using lower case, and my version of Python 2.6.6 has XPath 2.0 support, I suppose since lower case is only available in XPath 2.0.

How can I put lower case for use in my case, this is what I am looking for. Hope the example in itself. I am looking for ['USA', 'US'] as a result (both countries in one move, which can happen if lower case evaluates the country and the country the same).

HTML: doc.htm

 <html> <table> <tr> <td> Name of the Country : <span> USA </span> </td> </tr> <tr> <td> Name of the country : <span> UK </span> </td> </tr> </table>

Python:

 import lxml.html as lh doc = open('doc.htm', 'r') out = lh.parse(doc) doc.close() print out.xpath('//table/tr/td[text()[contains(. , "Country")]]/span/text()') # Prints : [' USA '] print out.xpath('//table/tr/td[text()[contains(. , "country")]]/span/text()') # Prints : [' UK '] print out.xpath('//table/tr/td[lower-case(text())[contains(. , "country")]]/span/text()') # Prints : [<Element td at 0x15db2710>]

Update:

 out.xpath('//table/tr/td[text()[contains(translate(., "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") , "country")]]/span/text()')

Now the question remains: can I save part of the translation as a global variable “handlecase” and print this global variable whenever I do XPath?

Something like this works:

 handlecase = """translate(., "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")""" out.xpath('//table/tr/td[text()[contains(%s , "country")]]/span/text()' % (handlecase))

But for simplicity and readability, I want to run it as follows:

 out.xpath('//table/tr/td[text()[contains(handlecase , "country")]]/span/text()')

+4

python html-parsing xpath lowercase lxml

Thinkcode Jun 27 '12 at 14:40

source share

2 answers

I find it easiest to get what you want, just write an XPath extension function.

By doing this, you can either write a lower-case() function or a case-insensitive search.

Here you can find information: http://lxml.de/extensions.html

+6

stranac Jun 27 '12 at 18:23

source share

Dimitre novatchev · Accepted Answer · 2012-06-28T04:09:41+0000

Using

  //td[translate(substring(text()[1], string-length(text()[1]) - 9), 'COUNTRY :', 'country' ) = 'country' ] /span/text()

XSLT Based Validation :

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select= "//td[translate(substring(text()[1], string-length(text()[1]) - 9), 'COUNTRY :', 'country' ) = 'country' ] /span/text() "/> </xsl:template> </xsl:stylesheet>

When this conversion is applied to the provided XML document:

 <html> <table> <tr> <td> Name of the Country : <span> USA </span> </td> </tr> <tr> <td> Name of the country : <span> UK </span> </td> </tr> </table> </html>

the XPath expression is evaluated, and the selected two text nodes are copied to the output:

  USA UK

Explanation

We use a specific XPath 1.0 expression that implements the standard XPath 2.0 ends-with($text, $s) function: this:

.....

 $s = substring($text, string-length($text) - string-length($s) +1)

0.2. The next step is to use the translate() function to convert the final long string of 10 characters to lowercase, excluding any spaces or any ":" characters.

0.3. If the result is the string (all lowercase letters) "country", then we select the child text nodes (only one in this case) of the s = span child of this td .

How to do case insensitive Python XPath searches using lxml?

More articles: