As llasram said, any text not in the text attribute will be in the tail attributes of the child nodes.
As an example, here is the simplest way to extract all text fragments (first and otherwise) in node:
html = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>' import lxml.html
If you prefer to sacrifice this relation to prevent texts from potentially containing blank lines, you can use this instead:
texts = [div.text] + [child.tail for child in div if child.tail]
I have not tested this with the plain old stdlib ElementTree, but it should work with that too. (Something that just happened to me when I saw a LXML-specific solution by Shane Holloway). I just prefer LXML because it got the best support for HTML ideographs, and I usually already installed it for lxml.html.clean
source share