Python strip XML tags from document

I am trying to remove XML tags from a document using Python, the language I'm starting in. Here is my first attempt at using regex, whixh was really hope for a better idea.

mfile = file("somefile.xml","w") for line in mfile: re.sub('<./>',"",line) #trying to match elements between < and /> 

It failed terribly. I would like to know how this should be done with regex.

Secondly, I googled and found: http://code.activestate.com/recipes/440481-strips-xmlhtml-tags-from-string/

which seems to work. But I would like to know if there is an easier way to get rid of all xml tags? Perhaps using ElementTree?

+7
source share
3 answers

Please note that this is usually not normal with regular expressions. See Jeremiah's answer .

Try the following:

 import re text = re.sub('<[^<]+>', "", open("/path/to/file").read()) with open("/path/to/file", "w") as f: f.write(text) 
-one
source

The most reliable way to do this is probably with LXML .

 from lxml import etree ... tree = etree.parse('somefile.xml') notags = etree.tostring(tree, encoding='utf8', method='text') print(notags) 

This will avoid problems with the "parsing" of XML with regular expressions and should correctly handle escaping and that's it.

+19
source

An alternative to Jeremiah's answers without requiring an external lxml library:

 import xml.etree.ElementTree as ET ... tree = ET.fromstring(Text) notags = ET.tostring(tree, encoding='utf8', method='text') print(notags) 

Should work with any Python> = 2.5

+9
source

All Articles