UTF-8 and UTF-16 in Java

I really expect the byte data below should be displayed differently, but in fact they are the same, according to wiki http://en.wikipedia.org/wiki/UTF-8#Examples , byte encoding looks different, but Why does Java print them as one and the same?

String a = "€"; byte[] utf16 = a.getBytes(); //Java default UTF-16 byte[] utf8 = null; try { utf8 = a.getBytes("UTF-8"); } catch (UnsupportedEncodingException e) { throw new RuntimeException(e); } for (int i = 0 ; i < utf16.length ; i ++){ System.out.println("utf16 = " + utf16[i]); } for (int i = 0 ; i < utf8.length ; i ++){ System.out.println("utf8 = " + utf8[i]); } 
+6
source share
4 answers

Although Java contains characters inside UTF-16, when you convert to bytes using String.getBytes() , each character is converted using the default platform encoding, which will likely look like windows-1252 . The results I get are as follows:

 utf16 = -30 utf16 = -126 utf16 = -84 utf8 = -30 utf8 = -126 utf8 = -84 

This means that by default, my system uses the "UTF-8" encoding.

Also note that the documentation for String.getBytes () has this comment: The behavior of this method when this string cannot be encoded in the default charset is unspecified.

Generally, you will avoid confusion if you always specify the encoding, as you do with a.getBytes("UTF-8")

In addition, another thing that can cause confusion includes Unicode characters directly in the source file: String a = "€"; . This euro symbol must be encoded for storage as one or more bytes in a file. When Java compiles your program, it sees these bytes and decodes them back into the euro symbol. You hope You must be sure that software that saves the euro symbol in a file (Notepad, Eclipse, etc.) encodes it in the same way that Java expects it to read it again. UTF-8 is becoming increasingly popular, but it is not universal and many editors will not write files to UTF-8.

+7
source

One curiosity, I wonder how the JVM knows the default source encoding ...

The mechanism used by the JVM to determine the default initial encoding is platform dependent. On UNIX / UNIX-like systems, it is defined by the LANG and LC_ * environment variables; see man locale .


Ermmm. Is this command used to check what is the default encoding in a particular OS?

It is right. But I was telling you this because a manual record describes how the default encoding is determined by environment variables.

In retrospect, this may not be what you mean by your original comment, but it is as the default encoding for the platform is indicated. (And the concept of "default character set" for an individual file does not make sense, see below.)

What if I say that I have 10 Java source files, half of them are saved as UTF-8 and the rest are saved as UTF-16, after compilation I move them (class file) to another OS platform, now how does the JVM know is their default encoding? Will the default encoding information be included in the Java class file?

This is a rather confusing set of questions:

  • The text file does not have a default character set. It has a character set / encoding.

  • A non-text file has no character encoding at all. The concept is meaningless.

  • There is no 100% reliable way to determine what the character encoding of a text file is.

  • If you do not tell the java compiler what the file encoding is, it will assume that it is the default encoding for the platform. The compiler is not trying to guess. If you make a mistake in the encoding, the compiler may or may not even notice your error.

  • Bytecode files (.class) are binary files (see 2).

  • When character and string literals are compiled into a ".class" file, they are presented NOW in a way that is not affected by the default encoding of the platform or anything else you can influence.

  • If you made a mistake with the encoding of the source file during compilation, you cannot fix it at the level of the .class file. The only option is to go back and recompile the classes, telling the Java compiler the correct encoding of the source file.

  • "What if I say that I have 10 Java source files, half of them are saved as UTF-8, and the rest are saved as UTF-16."
    Just don't do it!

    • Do not save source files as a combination of encodings. You will be touring.
    • I can’t believe that files are stored in UTF-16 at all ...

So, I am confused by the fact that although people say "platform dependent", is this related to the source file?

Platform dependence means that it is potentially dependent on the operating system, vendor and version of the JVM, hardware, etc.

This is not necessarily related to the source file. (The encoding of any given source file may differ from the default encoding.)

If this is not so, how to explain the phenomena above? In any case, the confusion above extends my question to "so what happens after I compiled the source file into a class file, because the class file may not contain encoding information, so now the result really depends on the" platform ", but not from the source file? "

A platform-specific mechanism (such as environment variables) determines what the java compiler sees as the default character set. If you do not undo this action (for example, by providing parameters to the java compiler on the command line), this is what the Java compiler will use as the character set of the source file. However, this may not be the correct character encoding for the source files; for example, if you created them on another machine with a different default character set. And if the java compiler uses the wrong character set to decode your source files, it can put the wrong character codes in the ".class" files.

The ".class" files are platform independent. But if they were created incorrectly because you did not tell the Java compiler the correct encoding for the source files, the .class files will contain the wrong characters.


Why do you mean: "the concept of" default character set "for a single file is pointless"?

I say this because it is true!

The default character set means the character set that is used when you do not specify it.

But we can control how we want the text file to be saved correctly? Even using a notepad, you can choose between encoding.

It is right. And it's you TELLING Notepad which character set is used for the file. If you do not pronounce it, Notepad will use the default character set to write the file.

Notepad has a bit of black magic to guess what character encoding is when it reads a text file. Basically, he looks at the first few bytes of a file to see if it starts with a byte of UTF-16 byte. If he sees one, he can heuristically distinguish between UTF-16, UTF-8 (generated by Microscoft) and "other." But it cannot distinguish between different "other" character encodings and does not recognize a UTF-8 file that does not start with a specification marker. (The specification in the UTF-8 file is standard for Microsoft ... and causes problems if the Java application reads the file and does not know to skip the specification symbol.)

In any case, the problem is not in writing the source file. They occur when the Java compiler reads the source file with the wrong character encoding.

+4
source

You are working with a bad hypothesis. The getBytes() method does not use UTF-16 encoding. It uses standard platform encoding.

You can request it using the java.nio.charset.Charset.defaultCharset() method. In my case it is UTF-8 and it will be the same for you too.

+3
source

By default, either UTF-8 or ISO-8859-1 is used if no specific platform encoding is found. Not UTF-16 . Therefore, in the end, you do byte conversion only in UTF-8 . That's why your byte[] match. You can find the default encoding using

  System.out.println(Charset.defaultCharset().name()); 
+1
source

Source: https://habr.com/ru/post/927986/


All Articles