Python element tree - extract text from an element, remove descriptors

Question

Python element tree - extract text from an element, remove descriptors

With ElementTree in Python, how can I extract all the text from a node by removing any tags in that element and keeping only the text?

For example, let's say that I have the following:

<tag>
  Some <a>example</a> text
</tag>

I want to return Some example text. How can I do it? Until now, the approaches that I have taken have had rather disastrous consequences.

+4

python xml-parsing elementtree

Trent bing Oct 14 '13 at 21:53

source share

2 answers

, - , text tail .

, ( stdlib 2.7 3.2, 2.6 3.1 ElementTree, lxml PyPI) tostring:

>>> s = '''<tag>
...   Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n  Some example text\n'

, . :

>>> ElementTree.tostring(s, method='text').strip()
'Some example text'

, , , , text tail s. ; , None. , :

def textify(t):
    s = []
    if t.text:
        s.append(t.text)
    for child in t.getchildren():
        s.extend(textify(child))
    if t.tail:
        s.append(t.tail)
    return ''.join(s)

, text tail str None. , , .

+2

abarnert 14 . '13 21:59

Benjamin Toueg · Accepted Answer · 2013-10-14T22:07:32+0000

If you are running Python 3.2+, you can use itertext.

itertext creates a text iterator that iterates over this element and all subelements in the order of the document and returns all the inner text:

>>> import xml.etree.ElementTree as ET
>>> xml = '<tag>Some <a>example</a> text</tag>'
>>> tree = ET.fromstring(xml)
>>> print(''.join(tree.itertext()))
'Some example text'

Python, 2.7, itertext:

>>> import xml.etree.ElementTree as ET
>>> xml = '<tag>Some <a>example</a> text</tag>'
>>> tree = ET.fromstring(xml)
>>> def itertext(self):
...     tag = self.tag
...     if not isinstance(tag, str) and tag is not None:
...         return
...     if self.text:
...         yield self.text
...     for e in self:
...         for s in e.itertext():
...             yield s
...         if e.tail:
...             yield e.tail
... 
>>> print(''.join(itertext(tree)))
Some example text

Python element tree - extract text from an element, remove descriptors

More articles: