Lxml does not handle tags correctly with multiple classes

I am trying to parse HTML using

a = lxml.html.fromstring('<html><body><span class="cut cross">Text of double class</span><span class="cross">Text of single class</span></body></html>') s1 = a.xpath('.//span[@class="cross"]') s2 = a.xpath('.//span[@class="cut cross"]') s3 = a.xpath('.//span[@class="cut"]') 

Output:

 s1 => [<Element span at 0x7f0a6807a530>] s2 => [<Element span at 0x7f0a6807a590>] s3 => [] 

But the first span tag has a class of 'cut', but s3 is empty. So far in s2, when I give both classes, it returns a tag.

+4
source share
3 answers

To avoid cut2 issues from Scharron, you can create spaces in front of the front and the end of the class.

 a.xpath('.//span[contains(concat(" ", @class, " "), " cut ")]') 
+1
source

The XPaths equal operator exactly matches the right and left operands. If you want to find one of the classes, you can use the contains function:

 a.xpath('.//span[contains(@class, "cut")]') 

However, it may also correspond to a class of type cut2 .

cssselect is a library that processes the CSS selector. A wrapper called pyquery mimics the jQuery library in python.

+7
source

I am sure that the CSS data model (i.e. classes are values ​​separated by spaces in the same class attribute) is not respected for XPath queries. To do what you want, you should look at using CSS selectors (e.g. via cssselect ).

+2
source

All Articles