Converting a string to byte [] returns the wrong value (encoding?)

Question

Converting a string to byte [] returns the wrong value (encoding?)

I read byte[] from the file and converted it to String :

 byte[] bytesFromFile = Files.readAllBytes(...); String stringFromFile = new String(bytesFromFile, "UTF-8");

I want to compare this with another byte[] that I get from the web service:

 String stringFromWebService = webService.getMyByteString(); byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");

So, I read byte[] from the file and converted it to String , and I got String from my web service and converted it to byte[] . Then I do the following tests:

 // works! org.junit.Assert.assertEquals(stringFromFile, stringFromWebService); // fails! org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);

Why does the second statement fail?

+5

java

Thomas uhrig Mar 24 '15 at 16:33

source share

3 answers

I do not understand it completely, but here is what I get like this:

The problem is that the data contains some bytes that are not valid UTF-8 bytes, as I know, with the following check:

 // returns false for my data! public static boolean isValidUTF8(byte[] input) { CharsetDecoder cs = Charset.forName("UTF-8").newDecoder(); try { cs.decode(ByteBuffer.wrap(input)); return true; } catch(CharacterCodingException e){ return false; } }

When I change the encoding to ISO-8859-1 , everything works fine. The strange thing (which I haven't figured out yet) is why my conversion ( new String(bytesFromFile, "UTF-8"); ) does not raise any exception (for example, my isValidUTF8 method), although the data is not valid in UTF-8.

However, I think I will also go and encode my byte[] in the Base64 string, since I don't want any more encoding problems.

0

Thomas uhrig Mar 25 '15 at 7:36

source share

The real problem in your code is that you do not know what real encoding of files is. When you read a line from a web service, you get a sequence of characters; when converting a string from characters to bytes, the conversion is performed correctly, since you specify how to convert char to bytes with a specific encoding ("UFT-8"). when you read a text file, you are faced with another problem. You have a sequence of bytes that need to be converted to characters. To do this correctly, you must know how characters that are converted to bytes, that is, what is the encoding of the file. For files (if not specified), these are platform constants; on windows, the file is encoded in win1252 (which is very close to ISO-8859-1); on linux / unix it depends, I think, UTF8 by default.

By the way, a web service call did an escrow operation under the hood; The HTTP call uses the taht header, which determines how characters are encoded, i.e. how to read bytes from a socket and then convert them to characters. Therefore, calling the SOAP web service returns xml (which can be connected to a Java object), and all encoding operations are performed correctly.

So, if you must read the characters from a file, you must encounter a coding problem; you can use BASE64, as you stated, but lose one of the main advantages of text files: they are readable, easier to debug and develop.

0

Giovanni Mar 25 '15 at 8:41

source share

J Richard snape · Accepted Answer · 2015-03-25T11:33:27+0000

Other answers examined the probable fact that the file is not UTF-8 encoded, which leads to the appearance of the described symptoms.

However, I think that the most interesting aspect of this is not that the byte[] statement is not executed, but that it is assert that the string values are the same. I am not 100% sure why this is so, but I think the following trawl through the source code may give us the answer:

Let's see how new String(bytesFromFile, "UTF-8"); works new String(bytesFromFile, "UTF-8"); - we see that the constructor calls StringCoding.decode()
This, in turn, if supplied with the tht UTF-8 character set, calls StringDecoder.decode()
This calls CharsetDecoder.decode() , which decides what to do if the character is not possible (which, I think, will take place - the UTF-8 character is represented)

In this case, it uses the action defined

 private CodingErrorAction unmappableCharacterAction = CodingErrorAction.REPORT;

This means that it still reports the character that it has decoded , although it is technically impossible.
I think this means that even when the code receives the umappable character, it replaces its best guess - therefore, I assume that its best guess is correct and, therefore, String representations match when comparing, but byte[] no longer matches.

This hypothesis is supported by the fact that the catch for CharacterCodingException in StringCoding.decode() says:

 } catch (CharacterCodingException x) { // Substitution is always enabled, // so this shouldn't happen

Converting a string to byte [] returns the wrong value (encoding?)

More articles: