Unicode in javadoc and comments?

Question

Unicode in javadoc and comments?

Some compilers failed on non-ASCII characters in JavaDoc and source code comments. What is the current (Java 7) and future (Java 8 and later) practice regarding Unicode in Java source files? Are there differences between IcedTea, OpenJDK, and other Java environments, and what is dictated by the language specification? Should all non-ASCII characters be reset in JavaDoc using HTML & escape codes? But what would be the equivalent of Java // comment?

Update : comments indicate that you can use any character set, and when compiling, you need to specify which char set is used in the source file. I will review this and will look for details on how to configure this using Ant, Eclipse, and Maven.

+13

java comments unicode javadoc

Egon willighagen Apr 28 '12 at 11:48

source share

2 answers

As commentators noted, encoding of source files can be passed on to (at least some) compilers. In this answer I will tell you how to convey this information.

Eclipse

Eclipse (3.7 verified) requires no special configuration, and you can happily use the Java source code, for example:

 double π = Math.PI;

Ant

 <javac encoding="UTF-8" ... > </javac>

Java

 javac -encoding UTF-8 src/main/Foo.java

+4

Egon willighagen Apr 28 '12 at 16:08

source share

Mike samuel · Accepted Answer · 2012-04-28T16:23:45+0000

Some compilers failed on non-ASCII characters in JavaDoc and source code comments.

This is probably because the compiler assumes that the input is UTF-8, and there are invalid UTF-8 sequences in the source file. The fact that they appear in the comments in the source editor does not matter, because lexer (which distinguishes comments from other tokens) never starts. A failure occurs when a tool tries to convert bytes to characters before lexer starts.

The man page for javac and javadoc will say

 -encoding name Specifies the source file encoding name, such as EUCJIS/SJIS. If this option is not specified, the plat- form default converter is used.

so run javadoc with encoding flag

 javadoc -encoding <encoding-name> ...

after replacing <encoding-name> with the encoding you used for your source files, it should force it to use the correct encoding.

If you have several codes used in a group of source files that you need to put together, you need to fix this first and set up a single unified encoding for all source files. You really have to use UTF-8 or stick with ASCII.

What is the current (Java 7) and future (Java 8 and later) practice regarding Unicode in Java source files?

The algorithm for working with the source file in Java is

Byte collection
Convert bytes to characters (UTF-16) using some encoding.
Replace all the sequences '\\' 'u' and then four hexadecimal digits with the code block corresponding to these hexadecimal digits. Error if there is a "\u" followed by four hexadecimal digits.
Insert characters into tokens.
Divide the markers into classes.

Current and past practice is that step 2, converting bytes to UTF-16 code units, depends on the tool that loads the compilation unit (source file), but the actual standard for command line interfaces is to use -encoding .

After this conversion, the language indicates that the style sequences \uABCD converted to UTF-16 code units (step 3) before being lexed and parsed.

For example:

 int a; \u0061 = 42;

is a valid pair of Java statements. Any Java source code tool must, after converting bytes to characters, but before parsing, look for \ uABCD sequences and convert them so that this code is converted to

 int a; a = 42;

before parsing. This happens regardless of where the \ uABCD sequence occurs.

This process looks something like this:

Receive bytes: [105, 110, 116, 32, 97, 59, 10, 92, 117, 48, 48, 54, 49, 32, 61, 32, 52, 50, 59]
Convert bytes to characters: ['i', 'n', 't', ' ', 'a', ';', '\n', '\\', 'u', '0', '0', '6', '1', ' ', '=', ' ', '4', '2', ';']
Replace unicode escapes: ['i', 'n', 't', ' ', 'a', ';', '\n', a, ' ', '=', ' ', '4', '2', ';']
Lex: ["int", "a", ";", "a", "=", "42", ";"]
Analysis: (Block (Variable (Type int) (Identifier "a")) (Assign (Reference "a") (Int 42)))

Should all non-ASCII characters be reset in JavaDoc using HTML & escape; codes?

There is no need other than special HTML characters, such as '<' , which you want to literally present in the documentation. You can use \uABCD in javadoc comments. Java process \u.... before parsing the source file so that they can appear inside lines, comments, and anywhere. That's why

 System.out.println("Hello, world!\u0022);

is a valid Java operator.

 /** @return \u03b8 in radians */

equivalently

 /** @return θ in radians */

regarding javadoc.

But what would be the equivalent of a Java // comment?

You can use // comments in java, but Javadoc only looks at /**...*/ comments for documentation. // comments do not contain metadata.

One of the branches of Java processing from \uABCD sequences is that although

 // Comment text.\u000A System.out.println("Not really comment text");

looks like a single line comment, and many IDEs will highlight it as such, it is not.

Unicode in javadoc and comments?

More articles: