How to get full XML or HTML content of an element using ElementTree?

That is, all text and subtitles, without the tag of the element itself?

Having

<p>blah <b>bleh</b> blih</p> 

I want

 blah <b>bleh</b> blih 

element.text returns "blah" and returns the value of etree.tostring (element):

 <p>blah <b>bleh</b> blih</p> 
+6
python xml api elementtree
source share
6 answers

This is the solution as a result of which I used:

 def element_to_string(element): s = element.text or "" for sub_element in element: s += etree.tostring(sub_element) s += element.tail return s 
+5
source share

ElementTree works fine, you have to assemble the answer yourself. Something like that...

 "".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] ) 

Thanks to JV amd PEZ for indicating errors.


Change

 >>> import xml.etree.ElementTree as xml >>> s= '<p>blah <b>bleh</b> blih</p>\n' >>> t=xml.fromstring(s) >>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] ) 'blah <b>bleh</b> blih' >>> 

The tail is not needed.

+11
source share

These are good answers that answer the OP question, especially if the question is limited to HTML. But documents are inherently messy, and the depth of nesting of elements is usually impossible to predict.

To mimic the DOM getTextContent (), you will need to use a (very) simple recursive mechanism.

To get only bare text:

 def get_deep_text( element ): text = element.text or '' for subelement in element: text += get_deep_text( subelement ) text += element.tail or '' return text print( get_deep_text( element_of_interest )) 

To get detailed information about the boundaries between the source text:

 root_el_of_interest.element_count = 0 def get_deep_text_w_boundaries( element, depth = 0 ): root_el_of_interest.element_count += 1 element_no = root_el_of_interest.element_count indent = depth * ' ' text1 = '%s(el %d - attribs: %s)\n' % ( indent, element_no, element.attrib, ) text1 += '%s(el %d - text: |%s|)' % ( indent, element_no, element.text or '', ) print( text1 ) for subelement in element: get_deep_text_w_boundaries( subelement, depth + 1 ) text2 = '%s(el %d - tail: |%s|)' % ( indent, element_no, element.tail or '', ) print( text2 ) get_deep_text_w_boundaries( root_el_of_interest ) 

Example output from a single pair in the LibreOffice Writer doc file (.fodt file):

 (el 1 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'Standard'}) (el 1 - text: |Ci-après individuellement la "|) (el 2 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'}) (el 2 - text: |Partie|) (el 2 - tail: |" et ensemble les "|) (el 3 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'}) (el 3 - text: |Parties|) (el 3 - tail: |", |) (el 1 - tail: | |) 

One of the points about randomness is that there is no hard and fast rule about when the text style indicates the word boundary, and when not: superscript immediately after the word (without a space) means a single word in every sense I can imagine. OTOH sometimes you can find, for example, a document in which the first letter is either bold for some reason, or perhaps uses a different style so that the first letter represents it as uppercase rather than just using the normal UC character .

And, of course, the smaller the “English-oriented” in the first place, the discussion becomes more complicated and complicated!

+3
source share

I doubt ElementTree is what you need to use for this. But, assuming that you have good reasons for using it, perhaps you could try removing the root handle from the fragment:

  re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element)) 
+1
source share

Most of the answers here are based on ElementTree XML parsing, even based on a PEZ response that is still partially dependent on ElementTree.

All this is good and suitable for most use cases, but, just for completeness, it is worth noting that ElementTree.tostring(...) will provide you with an equivalent fragment, but not always identical to the original payload. If for some very rare reason you want to extract content as is, you need to use a clean regex-based solution. This example is how I use a regex based solution.

0
source share

I don’t know if an external library can be a parameter, but in any case, provided that there is one <p> with this text on the page, the jQuery solution will be:

 alert($('p').html()); // returns blah <b>bleh</b> blih 
-3
source share

All Articles