I use Python to manage the XML file that I get from another system. This system generates invalid XML. Basically, he does not shy away from some in XML.
So, for example, I have a few lines:
<IceCream>Ben&Jerry</IceCream>
Of course, when parsing SAX or DOM, it produces an invalid token error.
For a more general background, this is a very large file (2 MB), quite flat and contains a lot of data in CDATA.
What I tried:
- Writing a Regex to replace only unesacped &, without reesacaping & gt; and such:
&(?!\w{2,4};). He fixed it, but he slipped away from the ampersands in CDATA, which caused errors in the destination system. I cannot undo everything in CDATA afterwards, because some of them must remain shielded. - Using Beautiful (Stone) Soup . Also out of luck. Instead of running away from free ampersands, he created an object (i.e.
&Jerry;). Not good.
The next step will be to write your own parser using a state machine. Save me from walking this road.
This is not a complex structure (very flat, no more than 4 layers), so perhaps a regular expression can catch areas that are not in CDATA.
.