Java Strings Character Encoding - For French - Dutch

I have the following code snippet

public static void main(String[] args) throws UnsupportedEncodingException { System.out.println(Charset.defaultCharset().toString()); String accentedE = "é"; String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8")); System.out.println(utf8); utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8")); System.out.println(utf8); utf8 = new String(accentedE.getBytes("utf-8")); System.out.println(utf8); utf8 = new String(accentedE.getBytes()); System.out.println(utf8); } 

The above result is as follows

 windows-1252 é ? é é 

Can someone help me figure out what this is doing? Why is this conclusion?

+4
source share
3 answers

If you already have a String , there is no need to encode and decode it straight back, the string is already the result of someone having decoded raw bytes.

In the case of a string literal, someone is a compiler that reads your source as raw bytes and decodes it in the encoding you specify. If you physically saved the source file encoded in Windows-1252, and the compiler decodes it as Windows-1252, all is well. If not, you need to fix this by declaring the correct encoding for the compiler when compiling the source ...

Line

 String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8")); 

Absolutely nothing. (Encode as UTF-8, decode as UTF-8 == no-op)

Line

 utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8")); 

Encodes a string as Windows-1252, and then decodes it as UTF-8. The result should only be decoded in Windows-1252 (because it is encoded in Windows-1252, duh), otherwise you will get strange results.

Line

 utf8 = new String(accentedE.getBytes("utf-8")); 

Encodes a string as UTF-8, and then decodes it as Windows-1252. The same principles apply as in the previous case.

Line

 utf8 = new String(accentedE.getBytes()); 

Absolutely nothing. (Encode as Windows-1252, decode as Windows-1252 == no-op)

An analogy with integers that might be easier to understand:

 int a = 555; //The case of encoding as X and decoding right back as X a = Integer.parseInt(String.valueOf(a), 10); //a is still 555 int b = 555; //The case of encoding as X and decoding right back as Y b = Integer.parseInt(String.valueOf(b), 15); //b is now 1205 IE strange result 

Both of them are useless, because we already have what we need before doing any code, the integer 555 .

There is a need to encode your string into raw bytes when it leaves your system, and there is a need to decode raw bytes into a string when they enter your system. No need to encode and decode back in the system.

+6
source

Line number 1 - the default character set on your system - is windows-1252.

String # 2 - You created a string by encoding a string literal in UTF-8 bytes, and then decrypt it using the UTF-8 scheme. The result is correctly formed String, which can be correctly displayed using windows-1252 encoding.

Line number 3 - you created a line by encoding the string literal as windows-1252, and then decrypt it using UTF-8. The UTF-8 decoder detected a sequence that cannot be UTF-8, and replaced the offensive character with the question mark “?”. (The UTF-8 format means that any byte that has the upper bit set to 1 is one byte of a multibyte character. But the windows-1252 encoding is only one byte ... ergo, this is bad UTF-8)

Line number 4 - you created a line encoded in UTF-8 and then decoded in windows-1252. In this case, the decoding did not “fail”, but it generated garbage (aka mojibake). The reason you got 2 characters of output is because the UTF-8 encoding "é" is a sequence of 2 bytes.

Line No. 5 - you created a String encoded as windows-1252 and decoded it as windows-1252. This will lead to the correct exit.


And the general lesson is that if you encode characters into bytes with one character encoding, and then decode with another character encoding, you may get some form of distortion.

+1
source

When you call the String getBytes method:

Encodes this string into a sequence of bytes, using the default platform encoding, storing the result in a new byte array.

Therefore, whenever you do:

 accentedE.getBytes() 

it takes the contents of the accentedE String as bytes encoded on the default OS code page, in your case cp-1252 .

This line:

 new String(accentedE.getBytes(), Charset.forName("UTF-8")) 

accepts accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, therefore, an error. The same situation on the other hand for:

 new String(accentedE.getBytes("utf-8")) 

The getBytes method takes accentedE bytes encoded in cp-1252, transcodes them to UTF-8, but then the String constructor encodes them using the default OS code page - cp-1252.

Creates a new line by decoding the specified byte array using the default platform encoding. The length of the new line is a function of the encoding and, therefore, cannot be equal to the length of the byte array.

I highly recommend reading this wonderful article:

Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!)

UPDATE:

In short, each character is stored as a number. To find out which character is a number, the OS uses code pages. Consider the following snippet:

 String accentedE = "é"; System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0])); System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1])); System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0])); 

which outputs:

 C3 A9 E9 

This is because the small accented e in UTF-8 is stored as two bytes of the C3A9 value, and in cp-1252 is stored as one byte of the E9 value. See the related article for a detailed explanation.

0
source

All Articles