Why does the Java octal only go out to 255?

The Java language specification states that escape lines within lines are "regular" C tags, such as \n and \t , but they also indicate octal escape sequences from \0 to \377 . In particular, JLS states:

 OctalEscape: \ OctalDigit \ OctalDigit OctalDigit \ ZeroToThree OctalDigit OctalDigit OctalDigit: one of 0 1 2 3 4 5 6 7 ZeroToThree: one of 0 1 2 3 

means something like \4715 is illegal even though it is within the range of the Java character (since Java characters are not bytes).

Why does Java have this arbitrary restriction? How do you specify octal codes for characters greater than 255?

+6
java octal escaping
Mar 03 '12 at 3:00
source share
5 answers

Probably for purely historical reasons, Java supports octal escape sequences. These escape sequences originated in C (or, possibly, in the C predecessors of B and BCPL), in those days when computers like PDP-7 controlled the Earth, and a lot of programming was done in assembly or directly in machine codes, and octal was the preferred base number for writing command codes, and there was no Unicode, just ASCII, so three octal digits were enough to represent the entire character set.

By the time Unicode and Java appeared, octal pretty much gave way to a hexadecimal number as the preferred base of numbers when the decimal value simply didn't execute. So, Java has an escape sequence \u that accepts hexadecimal digits. Probably the eighth escape sequence was only supported in order to make C programmers more comfortable, as well as making it easier to copy constant strings from C programs to Java programs.

Check out these links to historical trivia:

http://en.wikipedia.org/wiki/Octal#In_computers
http://en.wikipedia.org/wiki/PDP-11_architecture#Memory_management

+9
Mar 03 '12 at 4:59
source share

If I can understand the rules (please correct me if I am wrong):

 \ OctalDigit Examples: \0, \1, \2, \3, \4, \5, \6, \7 \ OctalDigit OctalDigit Examples: \00, \07, \17, \27, \37, \47, \57, \67, \77 \ ZeroToThree OctalDigit OctalDigit Examples: \000, \177, \277, \367,\377 

\t , \n , \\ do not fall under the rules of OctalEscape; they must be under separate escape character rules.

Decimal 255 is equal to Octal 377 (use the Windows calculator in scientific mode to confirm)

Therefore, a three-digit octal value falls in the range from \000 (0) to \377 (255)

Therefore, \4715 not a valid octal value, since this rule is greater than three octal digits. If you want to access a code point character with a decimal value of 4715, use the Unicode \u escape character to represent the UTF-16 \u126B (4715 in decimal), since each Java char is in Unicode UTF-16.

from http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html :

The char data type (and therefore the value that the Character encapsulation object has) is based on the original Unicode specification, which defined characters as 16-bit fixed-width objects. Unicode has since been modified to require more than 16 bits for presentation. The point's legal code range is now U + 0000 to U + 10FFFF, known as the Unicode scalar value. (Refer to the definition of U + n notation in the Unicode standard.)

Many characters from U + 0000 to U + FFFF are sometimes referred to as basic multilingual aircraft (BMP). Characters whose code points are greater than U + FFFF are called extra characters. Java 2 uses the UTF-16 representation in char arrays and in String and StringBuffer. In this representation, additional characters are represented as a pair of char values, the first of the range of high surrogates, (\ uD800- \ uDBFF), the second of the range of low surrogates (\ uDC00- \ uDFFF).

Edited by:

Everything that exceeds the allowable octal value of the 8-bit range (more than one byte) depends on the language. Some programming languages ​​may continue to fit the Unicode implementation; some cannot (limit one byte). Java definitely does not allow this, even if it supports Unicode.

Several vendor -specific programming languages ​​that restrict single-byte octal literals :

  • Java (all vendors): - an octal integer constant starting with 0 or one digit in the base-8 (up to 0377); \ 0 to \ 7, \ 00 to \ 77, \ 000 to \ 377 (in the format of an octal string literal)
  • C / C ++ (Microsoft) - octal integer constant starting from 0 (up to 0377); Text string format \nnn
  • Ruby - an octal integer constant starting with 0 (until 0377); Text string format \nnn

Several programming languages ​​(vendor-specific) that support more than one byte octal literals :

Several programming languages do not support octal literals :

  • C # - use Convert.ToInt32(integer, 8) for base-8 How can we convert a binary number to its octal number using C #?
+1
Mar 03 '12 at 3:30
source share

The real answer to the question “Why” will require us to ask designers of the Java language. We are not able to do this, and I doubt that they will even be able to answer. (Can you remember the detailed technical discussions you had ~ 20 years ago?)

However, a plausible explanation for this “limitation” is that:

  • octal adaptations were borrowed from C / C ++, in which they are also limited to 8 bits,
  • octal is old-fashioned, and IT people usually prefer and are more comfortable with hex, and
  • Java supports ways to express Unicode by embedding it directly in the source code or by using \u Unicode escapes ... which are not limited to string and character literals.

And to be honest, I never heard anyone (except you) claim that octal literals should be longer than 8 bits in Java.




By the way, when I started by calculating character sets, it was usually hardware specific and often was less than 8 bits. In my student coursework and my first job after graduation, I used CDC 6000 series machines with 60-bit words and a 6-bit character set - “Display Code”, I think we called it. In this context, Octal works very well. But as the industry moved toward (almost) universal implementation of the 8/16/32/64 bit architecture, people increasingly used hexadecimal rather than octal.

+1
Mar 03 '12 at 5:20
source share

The octal escape sequences \ 0- \ 377 are also inherited from C, and the restriction makes reasonable sense in a language such as C, where the characters == bytes (at least in halcyon days before wchar_t).

0
Mar 03 '12 at 5:00
source share

I do not know the reasons why octal escape files are limited to unicode encoding from 0 to 255. This may be for historical reasons. The question will mostly remain unanswered since there was no technical reason not to increase the range of octal shoots during Java design.

It should be noted, however, that there is a less obvious difference between unicode screens and octal screens. Octal screens are only processed as part of the lines, while unicode-escapes can occur anywhere in the file, for example, as part of the class name. Also note that the following example does not even compile:

 String a = "\u000A"; 

The reason is that \ u000A expands to a new line at a very early stage (mainly when downloading a file). The following code does not cause an error:

 String a = "\012"; 

\ 012 expands after the compiler has analyzed the code. This also applies to other screens, such as \ n, \ r, \ t, etc.

So, in conclusion: unicode escape sequences are NOT a replacement for eight-time screens. This is a completely different concept. In particular, to avoid any problems (as in the case of \ u000A above), use octal escape for code points from 0 to 255 and unicode screens for code points above 255.

0
Sep 09
source share



All Articles