Fix invalid XML with ampersands in Python

Question

Fix invalid XML with ampersands in Python

I use Python to manage the XML file that I get from another system. This system generates invalid XML. Basically, he does not shy away from some in XML.
So, for example, I have a few lines:

<IceCream>Ben&Jerry</IceCream>

Of course, when parsing SAX or DOM, it produces an invalid token error.
For a more general background, this is a very large file (2 MB), quite flat and contains a lot of data in CDATA.

What I tried:

Writing a Regex to replace only unesacped &, without reesacaping & gt; and such: &(?!\w{2,4};). He fixed it, but he slipped away from the ampersands in CDATA, which caused errors in the destination system. I cannot undo everything in CDATA afterwards, because some of them must remain shielded.
Using Beautiful (Stone) Soup . Also out of luck. Instead of running away from free ampersands, he created an object (i.e. &Jerry;). Not good.

The next step will be to write your own parser using a state machine. Save me from walking this road.
This is not a complex structure (very flat, no more than 4 layers), so perhaps a regular expression can catch areas that are not in CDATA.

.

+5

python dom xml regex sax

yulkes 22 '11 15:03

1

Eric Pruitt · Accepted Answer · 2011-05-22T15:31:39+0000

Python tidylib:

>>> import tidylib
>>> print tidylib.tidy_document("<IceCream>Ben&Jerry</IceCream>", {"input_xml": True})[0]
<IceCream>Ben&amp;Jerry</IceCream>

. , .

Fix invalid XML with ampersands in Python

More articles: