Python strip XML tags from document

Question

Python strip XML tags from document

I am trying to remove XML tags from a document using Python, the language I'm starting in. Here is my first attempt at using regex, whixh was really hope for a better idea.

mfile = file("somefile.xml","w") for line in mfile: re.sub('<./>',"",line) #trying to match elements between < and />

It failed terribly. I would like to know how this should be done with regex.

Secondly, I googled and found: http://code.activestate.com/recipes/440481-strips-xmlhtml-tags-from-string/

which seems to work. But I would like to know if there is an easier way to get rid of all xml tags? Perhaps using ElementTree?

+7

python xml regex

user485498 Oct 10 '12 at 15:57

source share

3 answers

The most reliable way to do this is probably with LXML .

 from lxml import etree ... tree = etree.parse('somefile.xml') notags = etree.tostring(tree, encoding='utf8', method='text') print(notags)

This will avoid problems with the "parsing" of XML with regular expressions and should correctly handle escaping and that's it.

+19

Jeremiah Oct 10 '12 at 16:23

source share

An alternative to Jeremiah's answers without requiring an external lxml library:

 import xml.etree.ElementTree as ET ... tree = ET.fromstring(Text) notags = ET.tostring(tree, encoding='utf8', method='text') print(notags)

Should work with any Python> = 2.5

+9

gaborous Sep 03 '13 at 11:16

source share

defuz · Accepted Answer · 2012-10-10T15:59:06+0000

Please note that this is usually not normal with regular expressions. See Jeremiah's answer .

Try the following:

 import re text = re.sub('<[^<]+>', "", open("/path/to/file").read()) with open("/path/to/file", "w") as f: f.write(text)

Python strip XML tags from document

More articles: