Fix invalid XML with ampersands in Python

I use Python to manage the XML file that I get from another system. This system generates invalid XML. Basically, he does not shy away from some in XML.
So, for example, I have a few lines:

<IceCream>Ben&Jerry</IceCream>


Of course, when parsing SAX or DOM, it produces an invalid token error.
For a more general background, this is a very large file (2 MB), quite flat and contains a lot of data in CDATA.

What I tried:

  • Writing a Regex to replace only unesacped &, without reesacaping & gt; and such: &(?!\w{2,4};). He fixed it, but he slipped away from the ampersands in CDATA, which caused errors in the destination system. I cannot undo everything in CDATA afterwards, because some of them must remain shielded.
  • Using Beautiful (Stone) Soup . Also out of luck. Instead of running away from free ampersands, he created an object (i.e. &Jerry;). Not good.

The next step will be to write your own parser using a state machine. Save me from walking this road.
This is not a complex structure (very flat, no more than 4 layers), so perhaps a regular expression can catch areas that are not in CDATA.

.

+5
1

Python tidylib:

>>> import tidylib
>>> print tidylib.tidy_document("<IceCream>Ben&Jerry</IceCream>", {"input_xml": True})[0]
<IceCream>Ben&amp;Jerry</IceCream>

. , .

+3

All Articles