Converting a string to byte [] returns the wrong value (encoding?)

I read byte[] from the file and converted it to String :

 byte[] bytesFromFile = Files.readAllBytes(...); String stringFromFile = new String(bytesFromFile, "UTF-8"); 

I want to compare this with another byte[] that I get from the web service:

 String stringFromWebService = webService.getMyByteString(); byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8"); 

So, I read byte[] from the file and converted it to String , and I got String from my web service and converted it to byte[] . Then I do the following tests:

 // works! org.junit.Assert.assertEquals(stringFromFile, stringFromWebService); // fails! org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService); 

Why does the second statement fail?

+5
source share
3 answers

Other answers examined the probable fact that the file is not UTF-8 encoded, which leads to the appearance of the described symptoms.

However, I think that the most interesting aspect of this is not that the byte[] statement is not executed, but that it is assert that the string values ​​are the same. I am not 100% sure why this is so, but I think the following trawl through the source code may give us the answer:

This hypothesis is supported by the fact that the catch for CharacterCodingException in StringCoding.decode() says:

 } catch (CharacterCodingException x) { // Substitution is always enabled, // so this shouldn't happen 
+1
source

I do not understand it completely, but here is what I get like this:

The problem is that the data contains some bytes that are not valid UTF-8 bytes, as I know, with the following check:

 // returns false for my data! public static boolean isValidUTF8(byte[] input) { CharsetDecoder cs = Charset.forName("UTF-8").newDecoder(); try { cs.decode(ByteBuffer.wrap(input)); return true; } catch(CharacterCodingException e){ return false; } } 

When I change the encoding to ISO-8859-1 , everything works fine. The strange thing (which I haven't figured out yet) is why my conversion ( new String(bytesFromFile, "UTF-8"); ) does not raise any exception (for example, my isValidUTF8 method), although the data is not valid in UTF-8.

However, I think I will also go and encode my byte[] in the Base64 string, since I don't want any more encoding problems.

0
source

The real problem in your code is that you do not know what real encoding of files is. When you read a line from a web service, you get a sequence of characters; when converting a string from characters to bytes, the conversion is performed correctly, since you specify how to convert char to bytes with a specific encoding ("UFT-8"). when you read a text file, you are faced with another problem. You have a sequence of bytes that need to be converted to characters. To do this correctly, you must know how characters that are converted to bytes, that is, what is the encoding of the file. For files (if not specified), these are platform constants; on windows, the file is encoded in win1252 (which is very close to ISO-8859-1); on linux / unix it depends, I think, UTF8 by default.

By the way, a web service call did an escrow operation under the hood; The HTTP call uses the taht header, which determines how characters are encoded, i.e. how to read bytes from a socket and then convert them to characters. Therefore, calling the SOAP web service returns xml (which can be connected to a Java object), and all encoding operations are performed correctly.

So, if you must read the characters from a file, you must encounter a coding problem; you can use BASE64, as you stated, but lose one of the main advantages of text files: they are readable, easier to debug and develop.

0
source

Source: https://habr.com/ru/post/1216012/


All Articles