So, in my current project, I'm using JAXB RI with the default Java parser from the Sun JRE (which I believe is Xerces) to decouple arbitrary XML.
First, I use XJC to compile an XSD of the following form:
<?xml version="1.0" encoding="utf-8" ?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foobar"> ... </xs:element> </xs:schema>
In the "good case" everything works as it was designed. That is, if I passed XML that matches this schema, then JAXB will correctly undo it in the object tree.
The problem occurs when I pass XML with external DTD links, for example
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foobar SYSTEM "http://blahblahblah/foobar.dtd"> <foobar></foobar>
When disassembling something like this, the SAX analyzer tries to load the remote object (" http: //somehost/foobar.dtd "), despite the fact that this fragment clearly does not correspond to the scheme that I compiled earlier using XJC.
To get around this behavior, since I know that any consistent XML (according to compiled XSD) will never require loading a remote object, I have to define my own EntityResolver, which closes the load on all remote legal entities. Therefore, instead of doing something like:
MyClass foo = (MyClass) myJAXBContext.createUnmarshaller().unmarshal(myReader);
I am forced to do this:
XMLReader myXMLReader = mySAXParser.getXMLReader(); myXMLReader.setEntityResolver(myCustomEntityResolver); SAXSource mySAXSource = new SAXSource(myXMLReader, new InputSource(myReader)); MyClass foo = (MyClass) myJAXBContext.createUnmarshaller().unmarshal(mySAXSource);
So my last question is:
When disassembling with JAXB, if loading remote objects using the SAX parser is automatically a short circuit, when can the XML in question be invalidated without loading these deleted objects?
Also, doesn't that seem like a security issue? Given that JAX-WS relies on JAXB under the hood, it seems that I can pass specially crafted XML to any JAX-WS web service and force the WS host to load any arbitrary URLs.
I'm a relative newbie to this, so something is probably missing me. Please let me know if so!