I hit my head against a brick wall with a bizarre problem that I know of, there will be an obvious answer, but I donโt see it for the rest of my life. All this is related to coding. Before the code, a simple description: I want to take an XML document that is encoded by Latin1 (ISO-8859-1), and then send the thing completely without changes through the HttpURLConnection. I have a small test class and raw XML that shows my problem. The XML file contains the Latin1 0xa2 character (percent sign), which is invalid UTF-8. I intentionally use this as my test case. The XML declaration is ISO-8859-1. I can read it without any problems, but then when I want to convert the org.w3c.dom.Document file to a byte [] array to send an HttpURLConnection, the 0xa2 character is converted to a UTF-8 encoded character (0xc2 0xa2) and the declaration Remains as ISO-8859-1. In other words, it transforms into two characters - completely wrong.
Code that does this:
FileInputStream input = new FileInputStream( "input-file" ); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware( true ); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse( input ); Source source = new DOMSource( document ); ByteArrayOutputStream baos = new ByteArrayOutputStream(); Result result = new StreamResult( baos ); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.transform( source, result ); byte[] bytes = baos.toByteArray(); FileOutputStream fos = new FileOutputStream( "output-file" ); fos.write( bytes );
I just write it to a file at that moment, while I figure out what this character will transform on earth. The input file has 0xa2, the output file contains 0xc2 0xa2. One way to fix this is to put this line in the second last block:
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, not all XML documents that I will work with will be Latin1; most will indeed be UTF-8 when they enter. I assume that I will not need to develop what encoding is so that I feed this to the transformer? I mean, of course, that this should be done for myself, and am I just doing something else wrong?
It occurred to me that I could just request a document to find out the encoding, and thus an extra line could just do the trick:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());
However, then I decided that this was not the answer, since document.getInputEncoding () returns a different line if I run it in the terminal in the linux field compared to when I run it in Eclipse on my Mac.
Any clues would be appreciated. I completely agree that I'm missing something obvious.