Getting non-contiguous text using lxml / ElementTree

Question

Getting non-contiguous text using lxml / ElementTree

Suppose I have such HTML code from which I need to select "text2" using lxml / ElementTree:

<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>

If I already have a div element like mydiv, then mydiv.text returns only "text1".

Using itertext () seems problematic or cumbersome at best, as it scans the entire tree under the div.

Is there a simple / elegant way to extract not the first piece of text from an element?

+4

python html-parsing lxml elementtree

Gj. 10 sept. '10 at 10:51

source share

4 answers

Such text will be in the tail attributes of the children of your element. If your item was in elem , then:

 elem[0].tail

Would give you the tail of the first element in the element, in your case "text2" , which you are looking for.

+6

llasram 10 sept. '10 at 10:58

source share

As llasram said, any text not in the text attribute will be in the tail attributes of the child nodes.

As an example, here is the simplest way to extract all text fragments (first and otherwise) in node:

 html = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>' import lxml.html # ...or lxml.etree as appropriate div = lxml.html.fromstring(html) texts = [div.text] + [child.tail for child in div] # Result: texts == ['text1', 'text2', 'text3'] # ...and you are guaranteed that div[x].tail == texts[x+1] # (which can be useful if you need to access or modify the DOM)

If you prefer to sacrifice this relation to prevent texts from potentially containing blank lines, you can use this instead:

 texts = [div.text] + [child.tail for child in div if child.tail]

I have not tested this with the plain old stdlib ElementTree, but it should work with that too. (Something that just happened to me when I saw a LXML-specific solution by Shane Holloway). I just prefer LXML because it got the best support for HTML ideographs, and I usually already installed it for lxml.html.clean

+4

ssokolow Sep 19 '10 at 19:37

source share

Use node.text_content() to get all the text below the node as one line.

+1

spiralx Oct 30 '12 at 7:39

source share

Shane holloway · Accepted Answer · 2010-09-23T21:45:30+0000

Well, lxml.etree provides full XPath support that allows you to address text elements:

 >>> import lxml.etree >>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>' >>> div = lxml.etree.fromstring(fragment) >>> div.xpath('./text()') ['text1', 'text2', 'text3']

Getting non-contiguous text using lxml / ElementTree

More articles: