How to parse an XML file containing a specification?

I want to parse an XML file from a URL using JDOM. But when you try:

SAXBuilder builder = new SAXBuilder(); builder.build(aUrl); 

I get this exception:

 Invalid byte 1 of 1-byte UTF-8 sequence. 

I thought this might be a specification issue. So I checked the source and saw the specification at the beginning of the file. I tried to read the url using aUrl.openStream() and remove the spec using Commons IO BOMInputStream . But, to my surprise, he did not find any specification. I tried to read from the stream and write to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 , but when I opened the file it had crazy characters.

I thought the problem was with the encoding of the original URL. But when I open the URL in the browser and save the XML file in a file and read this file through the process described above, everything works fine.

I appreciate any help for the possible cause of this problem.

+7
source share
2 answers

This HTTP server sends content in the form of GZIPped ( Content-Encoding: gzip ; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what this means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will unpack it for you. For example:

 builder.build(new GZIPInputStream(aUrl.openStream())); 

Edited to add , based on the following comment: if you don’t know in advance whether the GZIPped URL will be, you can write something like this:

 private InputStream openStream(final URL url) throws IOException { final URLConnection cxn = url.openConnection(); final String contentEncoding = cxn.getContentEncoding(); if(contentEncoding == null) return cxn.getInputStream(); else if(contentEncoding.equalsIgnoreCase("gzip") || contentEncoding.equalsIgnoreCase("x-gzip")) return new GZIPInputStream(cxn.getInputStream()); else throw new IOException("Unexpected content-encoding: " + contentEncoding); } 

(warning: not verified) and then use:

 builder.build(openStream(aUrl.openStream())); 

. This is basically equivalent to the above - aUrl.openStream() explicitly documented as a shorthand for aUrl.openConnection().getInputStream() - except that it considers the Content-Encoding header before deciding whether to wrap the stream in GZIPInputStream .

See the documentation for java.net.URLConnection .

+4
source

Perhaps you can avoid handling encoded responses by sending an empty Accept-Encoding header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html : "If there is no Accept-Encoding field in the request, the server MAY assume that the client will accept any encoding of the content.". It seems to be happening here.

0
source

All Articles