Extract text using lxml.html

Question

Extract text using lxml.html

I have an HTML file:

<html> <p>somestr <sup>1</sup> anotherstr </p> </html>

I would like to extract the text as:

somestr ¹ anotherstr

but I can’t figure out how to do this. I wrote a to_sup() function that converts numeric strings to superscript, so the closest I get is something like:

 for i in doc.xpath('.//p/text()|.//sup/text()'): if i.tag == 'sup': print to_sup(i), else: print i,

but I ElementStringResult doesn't seem to have a method to get the tag name, so I'm a bit lost. Any ideas how to solve it?

+6

python lxml

root Dec 17 '12 at 10:38

source share

2 answers

first solution (concatenation of text without a separator - see also python [lxml] - clearing html tags ):

  import lxml.html document = lxml.html.document_fromstring(html_string) # internally does: etree.XPath("string()")(document) print document.text_content()

this one helped me - concatenation as I needed:

  from lxml import etree print "\n".join(etree.XPath("//text()")(document))

+7

Robert Lujo May 29 '14 at 8:48

source share

Fred foo · Accepted Answer · 2012-12-17T10:43:27+0000

Just don't call text() on sup nodes in XPath.

 for x in doc.xpath("//p/text()|//sup"): try: print(to_sup(x.text)) except AttributeError: print(x)

Extract text using lxml.html

More articles: