Why can't some ASCII characters be expressed in the form "\ uXXXX" in the Java source code?

Question

Why can't some ASCII characters be expressed in the form "\ uXXXX" in the Java source code?

I came across this (again) today:

class Test { char ok = '\n'; char okAsWell = '\u000B'; char error = '\u000A'; }

It does not compile:

Invalid character constant on line 4.

The compiler seems to insist that instead I write '\ n'. I see no reason for this, but it is very annoying.

Is there a logical explanation why characters that have a special designation (e.g. \t , \n , \r ) should be expressed in this form in a Java source?

+57

java

Durandal Mar 07 '13 at 16:05

source share

5 answers

Unicode escape sequences, such as \u000a , are replaced with the actual characters they represent before the Java compiler does anything else with the source code. So your program ends up ending in

 char ch = ' ';

Thus, \u000a in the source code is replaced internally with a newline character. Note that this happens before the compiler actually reads and interprets your source code.

Referring to Java Language Specifics :

This is a compile-time error for the line terminator (§3.4), which appears after opening "and before closing".

And also everyone knows by heart, \n is a line terminator , quoting:

  LineTerminator: the ASCII LF character, also known as "newline" the ASCII CR character, also known as "return" the ASCII CR character followed by the ASCII LF character

Other characters that may cause problems, such as \ , ' and " .

+23

poitroae Mar 07 '13 at 16:13

source share

I think the reason is that \uXXXX expand when analyzing the code, see JLS §3.2. Lexical translations .

+4

NPE Mar 07 '13 at 16:14

source share

This is described in 3.3. Unicode Escapes http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html . Javac first finds \ uxxxx sequences in .java and replaces them with real characters, then compiles. When

 char error = '\u000A';

\ u000A will be replaced with the newline (10) character code and the actual text will be

 char error = ' ';

+4

Evgeniy Dorofeev Mar 07 '13 at 16:23

source share

Because the compiler treats them the same way as unescaped text.

This is a valid code:

  class \u00C9 {}

+2

McDowell Mar 07 '13 at 16:13

source share

assylias · Accepted Answer · 2013-03-07 16:12

Unicode characters are replaced by their value, so your string is replaced by the compiler:

 char error = ' ';

which is not a valid java operator.

This is dictated by the Language Specification :

The compiler for the Java programming language (the "Java compiler") first recognizes Unicode escape codes in its input by translating the ASCII \ u characters followed by four hexadecimal digits into the UTF-16 code block (§ 3.1) of the specified hexadecimal value and passing all other characters unchanged. Representing extra characters requires two consecutive Unicode screens. This translation step results in a sequence of input Unicode characters.

This can lead to unexpected things, for example, this is a valid Java program (it contains hidden Unicode characters) - courtesy of Peter Lowry

 public static void main(String[] args) { for (char c⁯‮h = 0; c⁯‮h < Character.MAX_VALUE; c⁯‮h++) { if (Character.isJavaIdentifierPart(c⁯‮h) && !Character.isJavaIdentifierStart(c⁯‮h)) { System.out.printf("%04x <%s>%n", (int) c⁯‮h, "" + c⁯‮h); } } }

Why can't some ASCII characters be expressed in the form "\ uXXXX" in the Java source code?

More articles: