An XML document reads like Latin1, but half is converted to UTF-8

I hit my head against a brick wall with a bizarre problem that I know of, there will be an obvious answer, but I donโ€™t see it for the rest of my life. All this is related to coding. Before the code, a simple description: I want to take an XML document that is encoded by Latin1 (ISO-8859-1), and then send the thing completely without changes through the HttpURLConnection. I have a small test class and raw XML that shows my problem. The XML file contains the Latin1 0xa2 character (percent sign), which is invalid UTF-8. I intentionally use this as my test case. The XML declaration is ISO-8859-1. I can read it without any problems, but then when I want to convert the org.w3c.dom.Document file to a byte [] array to send an HttpURLConnection, the 0xa2 character is converted to a UTF-8 encoded character (0xc2 0xa2) and the declaration Remains as ISO-8859-1. In other words, it transforms into two characters - completely wrong.

Code that does this:

FileInputStream input = new FileInputStream( "input-file" ); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware( true ); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse( input ); Source source = new DOMSource( document ); ByteArrayOutputStream baos = new ByteArrayOutputStream(); Result result = new StreamResult( baos ); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.transform( source, result ); byte[] bytes = baos.toByteArray(); FileOutputStream fos = new FileOutputStream( "output-file" ); fos.write( bytes ); 

I just write it to a file at that moment, while I figure out what this character will transform on earth. The input file has 0xa2, the output file contains 0xc2 0xa2. One way to fix this is to put this line in the second last block:

 transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1"); 

However, not all XML documents that I will work with will be Latin1; most will indeed be UTF-8 when they enter. I assume that I will not need to develop what encoding is so that I feed this to the transformer? I mean, of course, that this should be done for myself, and am I just doing something else wrong?

It occurred to me that I could just request a document to find out the encoding, and thus an extra line could just do the trick:

 transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding()); 

However, then I decided that this was not the answer, since document.getInputEncoding () returns a different line if I run it in the terminal in the linux field compared to when I run it in Eclipse on my Mac.

Any clues would be appreciated. I completely agree that I'm missing something obvious.

+4
source share
3 answers

yes, by default xml documents are written as utf-8, so you need to explicitly tell Transformer to use a different encoding. your last edit is a โ€œtrickโ€ for doing this so that it always matches the xml input encoding:

 transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding()); 

The only question is: do you really need to support input encoding?

+1
source

Why not just open it with a regular FileInputStream and pass byte streams to the output stream directly from this? Why would you load it into a DOM format in memory if you just send bytes for a byte via HttpURLConnection?

Edit: according to javadoc for a document, you should probably use document.getXmlEncoding () to get what matches the encoding in the XML prolog.

+1
source

This may be useful - too long for a comment, but not really an answer. From spec :

The encoding attribute indicates the preferred encoding used to display the result tree. XSLT processors must respect the values โ€‹โ€‹of UTF-8 and UTF-16. For other values, if the XSLT processor does not support the specified encoding, it may signal an error; if this does not signal an error, use UTF-8 or UTF-16 instead .

You can test with "encoding = junk" to see what it does.

Valid values โ€‹โ€‹for Java are described here . See also IANA encodings .

0
source

All Articles