Some compilers failed on non-ASCII characters in JavaDoc and source code comments.
This is probably because the compiler assumes that the input is UTF-8, and there are invalid UTF-8 sequences in the source file. The fact that they appear in the comments in the source editor does not matter, because lexer (which distinguishes comments from other tokens) never starts. A failure occurs when a tool tries to convert bytes to characters before lexer starts.
The man page for javac and javadoc will say
-encoding name Specifies the source file encoding name, such as EUCJIS/SJIS. If this option is not specified, the plat- form default converter is used.
so run javadoc with encoding flag
javadoc -encoding <encoding-name> ...
after replacing <encoding-name> with the encoding you used for your source files, it should force it to use the correct encoding.
If you have several codes used in a group of source files that you need to put together, you need to fix this first and set up a single unified encoding for all source files. You really have to use UTF-8 or stick with ASCII.
What is the current (Java 7) and future (Java 8 and later) practice regarding Unicode in Java source files?
The algorithm for working with the source file in Java is
- Byte collection
- Convert bytes to characters (UTF-16) using some encoding.
- Replace all the sequences
'\\' 'u' and then four hexadecimal digits with the code block corresponding to these hexadecimal digits. Error if there is a "\u" followed by four hexadecimal digits. - Insert characters into tokens.
- Divide the markers into classes.
Current and past practice is that step 2, converting bytes to UTF-16 code units, depends on the tool that loads the compilation unit (source file), but the actual standard for command line interfaces is to use -encoding .
After this conversion, the language indicates that the style sequences \uABCD converted to UTF-16 code units (step 3) before being lexed and parsed.
For example:
int a; \u0061 = 42;
is a valid pair of Java statements. Any Java source code tool must, after converting bytes to characters, but before parsing, look for \ uABCD sequences and convert them so that this code is converted to
int a; a = 42;
before parsing. This happens regardless of where the \ uABCD sequence occurs.
This process looks something like this:
- Receive bytes:
[105, 110, 116, 32, 97, 59, 10, 92, 117, 48, 48, 54, 49, 32, 61, 32, 52, 50, 59] - Convert bytes to characters:
['i', 'n', 't', ' ', 'a', ';', '\n', '\\', 'u', '0', '0', '6', '1', ' ', '=', ' ', '4', '2', ';'] - Replace unicode escapes:
['i', 'n', 't', ' ', 'a', ';', '\n', a, ' ', '=', ' ', '4', '2', ';'] - Lex:
["int", "a", ";", "a", "=", "42", ";"] - Analysis:
(Block (Variable (Type int) (Identifier "a")) (Assign (Reference "a") (Int 42)))
Should all non-ASCII characters be reset in JavaDoc using HTML & escape; codes?
There is no need other than special HTML characters, such as '<' , which you want to literally present in the documentation. You can use \uABCD in javadoc comments. Java process \u.... before parsing the source file so that they can appear inside lines, comments, and anywhere. That's why
System.out.println("Hello, world!\u0022);
is a valid Java operator.
/** @return \u03b8 in radians */
equivalently
/** @return θ in radians */
regarding javadoc.
But what would be the equivalent of a Java // comment?
You can use // comments in java, but Javadoc only looks at /**...*/ comments for documentation. // comments do not contain metadata.
One of the branches of Java processing from \uABCD sequences is that although
// Comment text.\u000A System.out.println("Not really comment text");
looks like a single line comment, and many IDEs will highlight it as such, it is not.