Why does Java CharsetEncoder define .onMalformedInput () / CharsetDecoder defines .onUnmappableCharacter ()?

Question

Why does Java CharsetEncoder define .onMalformedInput () / CharsetDecoder defines .onUnmappableCharacter ()?

A CharsetDecoder basically helps to decode a bytes sequence in a char sequence (see Charset#newDecoder() ). On the opposite side, CharsetEncoder (see Charset#newEncoder() ) does the opposite: take the char s sequence and encode them into the byte s sequence.

CharsetDecoder defines .onMalformedInput() and seems logical (some sequence of bytes cannot translate to a valid char sequence); but why .onUnmappableCharacter() , since its input is a byte sequence?

Similarly, CharsetEncoder defines .onUnmappableCharacter() , which is also logical here (for example, if your encoding is ASCII, you cannot encode ö ); but why does it also define .onMalformedInput() , since its input is a sequence of characters?

This is all the more intriguing because you cannot get an encoder from a decoder and vice versa, and neither of these two classes has a common ancestor ...

EDIT 1

Indeed, you can call .onMalformedInput() on a CharsetEncoder . You just need to provide an illegal char or char sequence. The following program is based on the fact that in UTF-16, a high surrogate should be followed by a low surrogate; here, instead of two large surrogates, a two-element char array is created and an attempt is made to encode it. NOTE, how creating a String from such a poorly formed char sequence does not exclude it at all :

code:

 public static void main(final String... args) throws CharacterCodingException { boolean found = false; char c = '.'; for (int i = 0; i < 65536; i++) { if (Character.isHighSurrogate((char) i)) { c = (char) i; found = true; break; } } if (!found) throw new IllegalStateException(); System.out.println("found: " + Integer.toHexString(c)); final char[] foo = { c, c }; new String(foo); // <-- DOES NOT THROW AN EXCEPTION!!! final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder() .onMalformedInput(CodingErrorAction.REPORT); encoder.encode(CharBuffer.wrap(foo)); }

Output:

 found: d800 Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:277) at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:798) at com.github.fge.largetext.LargeText.main(LargeText.java:166)

EDIT 2 But now, how about the opposite? From @Kairos, answer below by specifying manpage:

UnmappableCharacterException - If a sequence of bytes starting at the current position of the input buffer cannot be mapped to an equivalent sequence of characters, and the current action without the ability to display characters is CodingErrorAction.REPORT

Now, what "cannot be matched with an equivalent sequence of characters"?

I play a little with CharsetDecoder in this project and have not yet created such an error. I know how to reproduce an error in which, for example, you only have two bytes from the three-byte sequence of UTF-8, but this raises a MalformedInputException . All you have to do in this case is to restart decoding from the last known ByteBuffer position.

UnmappableCharacterException basically means that the character encoding itself will allow you to create an illegal char ; or an invalid Unicode code point.

Is this even possible?

+6

java character-encoding

fge Apr 05 '14 at 20:53

source share

1 answer

Hypothetical inthe clavicle · Accepted Answer · 2014-04-05T21:16:22+0000

The docs for CharsetEncoder.encode () indicate that it throws a MalformedInputException

If the character sequence starting with the input buffer, the current position is not a legal sixteen-bit Unicode sequence and the current malformed input is the action - CodingErrorAction.REPORT

So, you are given the opportunity to provide CodingErrorAction using onMalformedInput , so if you encounter one of these illegal sixteen-bit Unicode sequences, the action will be performed.

Similarly for CharsetDecoder.decode ()

UnmappableCharacterException - if a sequence of bytes starting from the current position of the input buffer cannot be matched with the equivalent character sequence and the current action of the unchanged character CodingErrorAction.REPORT

Why does Java CharsetEncoder define .onMalformedInput () / CharsetDecoder defines .onUnmappableCharacter ()?

More articles: