Why is "U +" used to denote a Unicode code point?

Question

Why is "U +" used to denote a Unicode code point?

Why do Unicode code points look like U+ <codepoint> ?

For example, U+2202 represents the symbol ∂ .

Why not U- (dash or hyphen) or anything else?

+53

unicode

Senthil Kumaran Aug 13 '09 at 18:16

source share

4 answers

The Unicode standard needs some notation to talk about code points and character names. He adopted the U + agreement, followed by four or more hexadecimal digits, at least Unicode Standard, version 2.0.0 , published in 1996 (source: PDF archived on the Unicode Consortium website).

The designation "U +" is useful. This makes it possible to mark hexadecimal digits as Unicode code points, rather than octets, or unlimited 16-bit values, or characters in other encodings. It works well when running text. U suggests Unicode.

My personal recollection of the Unicode software discussions in early 1990 was that the U + convention, followed by four hexadecimal digits, was common during Unicode 1.0 and Unicode 2.0. Unicode was considered a 16-bit system at that time. With the advent of Unicode 3.0 and character encoding at code points U + 010000 and higher, the “U-” agreement was adopted, followed by six hexadecimal digits, in particular to highlight the additional two digits in the number. (Or maybe it was the other way around, the transition from “U-” to “U +.”) In my experience, the U + convention is now much more common than the U- convention, and few people use the difference between "U +" and "U-" to indicate the number of digits.

I could not find the documentation on the transition from "U +" to "U-". Archived messages from the mailing list since 1990 should be proof of this, but I can't tell anyone with confidence. Unicode Standard 2.0 stated: "Unicode character codes have a uniform width of 16 bits." (p. 2-3). He concluded his agreement that "the individual Unicode value is expressed as U + nnnn, where nnnn is a four-digit number in the hexadecimal system" (p. 1-5). Surrogate values were highlighted, but character codes were not defined above U + FFFF, and there was no mention of UTF-16 or UTF-32. He used "U +" with four digits. Unicode Standard 3.0.0 , published in 2000, defined UTF-16 (pp. 46-47) and discussed codes U + 010000 and higher. He used “U +” with four digits in some places and six digits in other places. The strongest trace I found was in Unicode Standard, version 6.0.0 , where the BNF syntax notation table defines the characters U+HHHH and U-HHHHHHHH (p. 559).

The U + notation is not the only convention for representing Unicode code units or code units. For example, Python defines the following string literals :

u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character indicated by four hexadecimal digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hexadecimal digits

+11

Jim DeLaHunt Jan 17 '12 at 8:01

source share

It depends on which version of the Unicode standard you are talking about. From Wikipedia :

Older versions of standard use have similar designations, but slightly different rules. For example, Unicode 3.0 uses "U-" followed by eight digits, and it is allowed to use "U +" with only four digits until you specify a code, not a dot code.

+7

Sean Bright Aug 13 '09 at 18:19

source share

This is just an agreement showing that this is Unicode value. A bit like "0x" or "h" for hexadecimal values ( 0xB9 or B9h ). Why 0xB9 , not 0hB9 (or &hB9 or $B9 )? Just because the way the coin turned over :-)

+3

Mihai Nita May 28 '11 at 9:57 a.m.

source share

Jukka K. Korpela · Accepted Answer · 2012-01-17 07:39

The “U +” characters are an ASCIIfied version of the U + 228E MULTISET UNION “⊎” U character (U-shaped union character with a plus sign inside it), which was supposed to symbolize Unicode as a union of character sets. See Kenneth Whistlers description in Unicode mailing list .

Why is "U +" used to denote a Unicode code point?

More articles: