Extract text using lxml.html

I have an HTML file:

<html> <p>somestr <sup>1</sup> anotherstr </p> </html> 

I would like to extract the text as:

somestr 1 anotherstr

but I can’t figure out how to do this. I wrote a to_sup() function that converts numeric strings to superscript, so the closest I get is something like:

 for i in doc.xpath('.//p/text()|.//sup/text()'): if i.tag == 'sup': print to_sup(i), else: print i, 

but I ElementStringResult doesn't seem to have a method to get the tag name, so I'm a bit lost. Any ideas how to solve it?

+6
source share
2 answers

Just don't call text() on sup nodes in XPath.

 for x in doc.xpath("//p/text()|//sup"): try: print(to_sup(x.text)) except AttributeError: print(x) 
+3
source

first solution (concatenation of text without a separator - see also python [lxml] - clearing html tags ):

  import lxml.html document = lxml.html.document_fromstring(html_string) # internally does: etree.XPath("string()")(document) print document.text_content() 

this one helped me - concatenation as I needed:

  from lxml import etree print "\n".join(etree.XPath("//text()")(document)) 
+7
source

All Articles