A CharsetDecoder basically helps to decode a bytes sequence in a char sequence (see Charset#newDecoder() ). On the opposite side, CharsetEncoder (see Charset#newEncoder() ) does the opposite: take the char s sequence and encode them into the byte s sequence.
CharsetDecoder defines .onMalformedInput() and seems logical (some sequence of bytes cannot translate to a valid char sequence); but why .onUnmappableCharacter() , since its input is a byte sequence?
Similarly, CharsetEncoder defines .onUnmappableCharacter() , which is also logical here (for example, if your encoding is ASCII, you cannot encode ΓΆ ); but why does it also define .onMalformedInput() , since its input is a sequence of characters?
This is all the more intriguing because you cannot get an encoder from a decoder and vice versa, and neither of these two classes has a common ancestor ...
EDIT 1
Indeed, you can call .onMalformedInput() on a CharsetEncoder . You just need to provide an illegal char or char sequence. The following program is based on the fact that in UTF-16, a high surrogate should be followed by a low surrogate; here, instead of two large surrogates, a two-element char array is created and an attempt is made to encode it. NOTE, how creating a String from such a poorly formed char sequence does not exclude it at all :
code:
public static void main(final String... args) throws CharacterCodingException { boolean found = false; char c = '.'; for (int i = 0; i < 65536; i++) { if (Character.isHighSurrogate((char) i)) { c = (char) i; found = true; break; } } if (!found) throw new IllegalStateException(); System.out.println("found: " + Integer.toHexString(c)); final char[] foo = { c, c }; new String(foo);
Output:
found: d800 Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:277) at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:798) at com.github.fge.largetext.LargeText.main(LargeText.java:166)
EDIT 2 But now, how about the opposite? From @Kairos, answer below by specifying manpage:
UnmappableCharacterException - If a sequence of bytes starting at the current position of the input buffer cannot be mapped to an equivalent sequence of characters, and the current action without the ability to display characters is CodingErrorAction.REPORT
Now, what "cannot be matched with an equivalent sequence of characters"?
I play a little with CharsetDecoder in this project and have not yet created such an error. I know how to reproduce an error in which, for example, you only have two bytes from the three-byte sequence of UTF-8, but this raises a MalformedInputException . All you have to do in this case is to restart decoding from the last known ByteBuffer position.
UnmappableCharacterException basically means that the character encoding itself will allow you to create an illegal char ; or an invalid Unicode code point.
Is this even possible?