These are good answers that answer the OP question, especially if the question is limited to HTML. But documents are inherently messy, and the depth of nesting of elements is usually impossible to predict.
To mimic the DOM getTextContent (), you will need to use a (very) simple recursive mechanism.
To get only bare text:
def get_deep_text( element ): text = element.text or '' for subelement in element: text += get_deep_text( subelement ) text += element.tail or '' return text print( get_deep_text( element_of_interest ))
To get detailed information about the boundaries between the source text:
root_el_of_interest.element_count = 0 def get_deep_text_w_boundaries( element, depth = 0 ): root_el_of_interest.element_count += 1 element_no = root_el_of_interest.element_count indent = depth * ' ' text1 = '%s(el %d - attribs: %s)\n' % ( indent, element_no, element.attrib, ) text1 += '%s(el %d - text: |%s|)' % ( indent, element_no, element.text or '', ) print( text1 ) for subelement in element: get_deep_text_w_boundaries( subelement, depth + 1 ) text2 = '%s(el %d - tail: |%s|)' % ( indent, element_no, element.tail or '', ) print( text2 ) get_deep_text_w_boundaries( root_el_of_interest )
Example output from a single pair in the LibreOffice Writer doc file (.fodt file):
(el 1 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'Standard'}) (el 1 - text: |Ci-après individuellement la "|) (el 2 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'}) (el 2 - text: |Partie|) (el 2 - tail: |" et ensemble les "|) (el 3 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'}) (el 3 - text: |Parties|) (el 3 - tail: |", |) (el 1 - tail: | |)
One of the points about randomness is that there is no hard and fast rule about when the text style indicates the word boundary, and when not: superscript immediately after the word (without a space) means a single word in every sense I can imagine. OTOH sometimes you can find, for example, a document in which the first letter is either bold for some reason, or perhaps uses a different style so that the first letter represents it as uppercase rather than just using the normal UC character .
And, of course, the smaller the “English-oriented” in the first place, the discussion becomes more complicated and complicated!