Java 8 change in UTF-8 decoding

Question

Java 8 change in UTF-8 decoding

We recently migrated our application to JDK 8 from JDK 7. After this change, we had a problem with the following code fragment.

String output = new String(byteArray, "UTF-8");

The byte array may contain invalid UTF-8 byte sequences. The same byte array when decoding UTF-8 results in two difference strings in Java 7 and Java 8.

According to the response to this SO post , Java 8 "fixes" the error in Java 7 and replaces invalid UTF-8 byte sequences with a replacement string that conforms to the UTF-8 specification.

But we would like to stick with the decryption version of Java 7.

We tried using CharsetDecoder with CodingErrorAction as REPLACE, REPORT, and IGNORE in Java 8. However, we were unable to create the same line as Java 7.

Can we do this using a technique of reasonable complexity?

+7

java java-8 utf-8 regression

Jiraiya Jun 01 '15 at 13:59

source share

1 answer

Jiraiya · Accepted Answer · 2015-06-02T10:39:11+0000

From the pointers provided by @Holger, it was clear that we had to write our own CharsetDecoder.

I copied the OpenJDK version of the sun.nio.cs.UTF_8 class, renamed it to CustomUTF_8 and used it to build such a string

 String output = new String(bytes, new CustomUTF_8());

I plan to conduct extensive tests, cross-checking the outputs generated in Java 7 and Java 8. This is a temporary solution, while I am trying to fix the actual problem of transferring output from hmac directly to String without Base64 encoding, which it is associated with first.

  String output = new String(Base64.Encoder.encode(bytes), Charset.forname("UTF-8"));

Java 8 change in UTF-8 decoding

More articles: