Lxml does not handle tags correctly with multiple classes

Question

Lxml does not handle tags correctly with multiple classes

I am trying to parse HTML using

a = lxml.html.fromstring('<html><body><span class="cut cross">Text of double class</span><span class="cross">Text of single class</span></body></html>') s1 = a.xpath('.//span[@class="cross"]') s2 = a.xpath('.//span[@class="cut cross"]') s3 = a.xpath('.//span[@class="cut"]')

Output:

 s1 => [<Element span at 0x7f0a6807a530>] s2 => [<Element span at 0x7f0a6807a590>] s3 => []

But the first span tag has a class of 'cut', but s3 is empty. So far in s2, when I give both classes, it returns a tag.

+4

python lxml

WeaklyTyped Jan 21 '13 at 15:45

source share

3 answers

The XPaths equal operator exactly matches the right and left operands. If you want to find one of the classes, you can use the contains function:

 a.xpath('.//span[contains(@class, "cut")]')

However, it may also correspond to a class of type cut2 .

cssselect is a library that processes the CSS selector. A wrapper called pyquery mimics the jQuery library in python.

+7

Scharron Jan 21 '13 at 15:54

source share

I am sure that the CSS data model (i.e. classes are values separated by spaces in the same class attribute) is not respected for XPath queries. To do what you want, you should look at using CSS selectors (e.g. via cssselect ).

+2

djc Jan 21 '13 at 15:49

source share

Drover · Accepted Answer · 2013-01-21T20:27:20+0000

To avoid cut2 issues from Scharron, you can create spaces in front of the front and the end of the class.

 a.xpath('.//span[contains(concat(" ", @class, " "), " cut ")]')

Lxml does not handle tags correctly with multiple classes

More articles: