Avoid unescaped characters in XML using Python

Question

Avoid unescaped characters in XML using Python

I need to escape special characters in an invalid XML file with a length of about 5000 lines. Here is an example of XML that I have to deal with:

<root> <element> <name>name & surname</name> <mail> name@name.org </mail> </element> </root>

Here the problem is the & symbol in the name. How would you avoid special characters like this with the Python library? I have not found a way to do this with BeautifulSoup .

+7

python xml special-characters lxml beautifulsoup

Jérôme pigeot Feb 11 '11 at 17:38

source share

4 answers

You probably just want to make some simple regular expressions in HTML before throwing them into BeautifulSoup.

Even simpler, if there are no SGML objects ( &...; ) in the code, html=html.replace('&','&') will do the trick.

Otherwise, try the following:

 x ="<html><h1>Fish & Chips & Gravy</h1><p>Fish &amp; Chips &#x0026; Gravy</p>" import re q=re.sub(r'&([^a-zA-Z#])',r'&amp;\1',x) print q

Essentially, the regular expression searches for & , which is not followed by alphanumeric or # characters. It will not handle ampersands at the end of lines, but this is a possible fix.

+2

Dragon Feb 14 '11 at 19:02

source share

 <name>name & surname</name>

XML is not valid. It should be:

 <name>name &amp; surname</name>

All relevant XML tools should create this - you usually don't need to worry. If you create a string with the parameter '&' character, then the XML tool will output an escaped version. If you create the line manually, it is your responsibility to ensure that it escapes. If you are using an XML editor, you should avoid it for you.

If the file was provided to you by someone else, send it and inform that it is not correctly formed. If they no longer exist, you will have to use a simple text editor. It is fragile and dirty, but there is no other way. If the file has ampersands in another place that are used for escaping, then the file is garbage.

See 10 year post here and later here .

0

peter.murray.rust Feb 11 '11 at 17:50

source share

This answer provides XML sanitizer functions, although they do not escape characters without escaping, but simply discard them.

Using bs4 with lxml

A question was asked how to do this with Beautiful Soup. Here is a function that will sanitize a small XML object bytes . It has been tested with the requirements of the beautifulsoup4==4.8.0 and lxml==4.4.0 package. Please note that here lxml is required for bs4 .

 import xml.etree.ElementTree import bs4 def sanitize_xml(content: bytes) -> bytes: # Ref: https://stackoverflow.com/a/57450722/ try: xml.etree.ElementTree.fromstring(content) except xml.etree.ElementTree.ParseError: return bs4.BeautifulSoup(content, features='lxml-xml').encode() return content # already valid XML

Using only lxml

Obviously, using bs4 and lxml does not make much sense when this can only be done with lxml . This lxml==4.4.0 using the disinfection function is essentially derived from jfs answer.

 import lxml.etree def sanitize_xml(content: bytes) -> bytes: # Ref: https://stackoverflow.com/a/57450722/ try: lxml.etree.fromstring(content) except lxml.etree.XMLSyntaxError: root = lxml.etree.fromstring(content, parser=lxml.etree.XMLParser(recover=True)) return lxml.etree.tostring(root) return content # already valid XML

0

Acumenus Aug 11 '19 at 14:18

source share

jfs · Accepted Answer · 2011-02-14T21:25:54+0000

If you don't care about invalid characters in xml, you can use the XML parser recover parameter (see Parse broken XML with lxml.etree.iterparse )

 from lxml import etree parser = etree.XMLParser(recover=True) # recover from bad characters. root = etree.fromstring(broken_xml, parser=parser) print etree.tostring(root)

Exit

 <root> <element> <name>name surname</name> <mail> name@name.org </mail> </element> </root>

Avoid unescaped characters in XML using Python

Exit

Using bs4 with lxml

Using only lxml

More articles: