? For example, U+2202 represents the symbo...">

Why is "U +" used to denote a Unicode code point?

Why do Unicode code points look like U+ <codepoint> ?

For example, U+2202 represents the symbol βˆ‚ .

Why not U- (dash or hyphen) or anything else?

+53
unicode
Aug 13 '09 at 18:16
source share
4 answers

The β€œU +” characters are an ASCIIfied version of the U + 228E MULTISET UNION β€œβŠŽβ€ U character (U-shaped union character with a plus sign inside it), which was supposed to symbolize Unicode as a union of character sets. See Kenneth Whistlers description in Unicode mailing list .

+107
Jan 17 2018-12-12T00:
source share

The Unicode standard needs some notation to talk about code points and character names. He adopted the U + agreement, followed by four or more hexadecimal digits, at least Unicode Standard, version 2.0.0 , published in 1996 (source: PDF archived on the Unicode Consortium website).

The designation "U +" is useful. This makes it possible to mark hexadecimal digits as Unicode code points, rather than octets, or unlimited 16-bit values, or characters in other encodings. It works well when running text. U suggests Unicode.

My personal recollection of the Unicode software discussions in early 1990 was that the U + convention, followed by four hexadecimal digits, was common during Unicode 1.0 and Unicode 2.0. Unicode was considered a 16-bit system at that time. With the advent of Unicode 3.0 and character encoding at code points U + 010000 and higher, the β€œU-” agreement was adopted, followed by six hexadecimal digits, in particular to highlight the additional two digits in the number. (Or maybe it was the other way around, the transition from β€œU-” to β€œU +.”) In my experience, the U + convention is now much more common than the U- convention, and few people use the difference between "U +" and "U-" to indicate the number of digits.

I could not find the documentation on the transition from "U +" to "U-". Archived messages from the mailing list since 1990 should be proof of this, but I can't tell anyone with confidence. Unicode Standard 2.0 stated: "Unicode character codes have a uniform width of 16 bits." (p. 2-3). He concluded his agreement that "the individual Unicode value is expressed as U + nnnn, where nnnn is a four-digit number in the hexadecimal system" (p. 1-5). Surrogate values ​​were highlighted, but character codes were not defined above U + FFFF, and there was no mention of UTF-16 or UTF-32. He used "U +" with four digits. Unicode Standard 3.0.0 , published in 2000, defined UTF-16 (pp. 46-47) and discussed codes U + 010000 and higher. He used β€œU +” with four digits in some places and six digits in other places. The strongest trace I found was in Unicode Standard, version 6.0.0 , where the BNF syntax notation table defines the characters U+HHHH and U-HHHHHHHH (p. 559).

The U + notation is not the only convention for representing Unicode code units or code units. For example, Python defines the following string literals :

  • u'xyz' to indicate a Unicode string, a sequence of Unicode characters
  • '\uxxxx' to indicate a string with a unicode character indicated by four hexadecimal digits
  • '\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hexadecimal digits
+11
Jan 17 '12 at 8:01
source share

It depends on which version of the Unicode standard you are talking about. From Wikipedia :

Older versions of standard use have similar designations, but slightly different rules. For example, Unicode 3.0 uses "U-" followed by eight digits, and it is allowed to use "U +" with only four digits until you specify a code, not a dot code.

+7
Aug 13 '09 at 18:19
source share

This is just an agreement showing that this is Unicode value. A bit like "0x" or "h" for hexadecimal values ​​( 0xB9 or B9h ). Why 0xB9 , not 0hB9 (or &hB9 or $B9 )? Just because the way the coin turned over :-)

+3
May 28 '11 at 9:57 a.m.
source share



All Articles