Sanitizing Bad XML in Java

I use a third-party library that returns "XML", which is invalid because it contains invalid characters as well as undeclared objects. I need to use the Java XML parser to parse this XML, but it is suffocating.

Is there a general way to clear this XML so that it becomes valid?

+7
java xml
source share
4 answers

I think your options are:

The first two are heavier, given that they are designed to parse poorly formed HTML. If you know that the problems are related to the encoding and entities, but are otherwise well-formed, I suggest you make your own:

  • standardize encoding for UTF-8
  • use a standard encoder for text between characters> and <characters (text objects).
+6
source share

It looks like you need to find out if there is a way to automatically clear the data before passing it to the parser. How are certain characters invalid, invalid in a declared character set, or nonequivalent XML metacharacters such as '<'?

For undeclared objects, I somehow solved this by setting up a SAX parser with an error handler that basically ignored these errors. It will help you too. See ErrorHandler API.

+3
source share
+1
source share

For illegal characters, I would recommend implementing a Reader filter; just convert them (provided they are control characters) with a space or separate.

Undeclared objects are more complex; some xml parsers allow you to define an alternative DTD to use ( Woodstox , at least. If so, you can enter a DTD that declares the objects you need.

0
source share

All Articles