Getting non-contiguous text using lxml / ElementTree

Suppose I have such HTML code from which I need to select "text2" using lxml / ElementTree:

<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div> 

If I already have a div element like mydiv, then mydiv.text returns only "text1".

Using itertext () seems problematic or cumbersome at best, as it scans the entire tree under the div.

Is there a simple / elegant way to extract not the first piece of text from an element?

+4
source share
4 answers

Well, lxml.etree provides full XPath support that allows you to address text elements:

 >>> import lxml.etree >>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>' >>> div = lxml.etree.fromstring(fragment) >>> div.xpath('./text()') ['text1', 'text2', 'text3'] 
+12
source

Such text will be in the tail attributes of the children of your element. If your item was in elem , then:

 elem[0].tail 

Would give you the tail of the first element in the element, in your case "text2" , which you are looking for.

+6
source

As llasram said, any text not in the text attribute will be in the tail attributes of the child nodes.

As an example, here is the simplest way to extract all text fragments (first and otherwise) in node:

 html = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>' import lxml.html # ...or lxml.etree as appropriate div = lxml.html.fromstring(html) texts = [div.text] + [child.tail for child in div] # Result: texts == ['text1', 'text2', 'text3'] # ...and you are guaranteed that div[x].tail == texts[x+1] # (which can be useful if you need to access or modify the DOM) 

If you prefer to sacrifice this relation to prevent texts from potentially containing blank lines, you can use this instead:

 texts = [div.text] + [child.tail for child in div if child.tail] 

I have not tested this with the plain old stdlib ElementTree, but it should work with that too. (Something that just happened to me when I saw a LXML-specific solution by Shane Holloway). I just prefer LXML because it got the best support for HTML ideographs, and I usually already installed it for lxml.html.clean

+4
source

Use node.text_content() to get all the text below the node as one line.

+1
source

All Articles