Is 16-bit wchar_t formally valid for representing full Unicode?

In the ¹comp.lang.C ++ Usenet group, I recently argued on the basis that I thought I knew 16-bit wchar_t , encoded in UTF-16, where sometimes two such values ​​(called a “surrogate pair”) are required for a single Unicode code point, is not valid for Unicode representation.

This, of course, is inconvenient and contradicts the assumption of the standard C and C ++ libraries (for example, character classification) that each code point is represented as one value, although the Unicode ²Techchnical Note 12 consortium from 2004 is a good example for using UTF-16 for internal processing, with an impressive list of software that does.

And, of course, it seems that the original intention was to have one wchar_t value for the code point, consistent with the assumptions of the standard C and C ++ libraries. For example. on the ³unix.org Amendment ISO C 1 (MSE) web page, on the amendment that brought wchar_t to standard C in 1995, the authors claim that

The main advantage of the single-byte / single-character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of a wide character was invented. A wide character is an abstract data type large enough to contain the largest character that supported on a specific platform.

But as it turned out, the C and C ++ standards do not seem to talk about the largest supported character, but only about the largest extended character sets in supported locales: wchar_t must be large enough to represent each code point in the largest such extended character set - but not Unicode if there is no unicode.

C99 §7.17 / 2 (from draft N869):

" [type wchar_t ] is an integer type whose range of values ​​can be different codes for all members of the largest extended character set specified among supported locales.

This is almost identical to the same wording as in the C ++ standard. And it seems that this means that with a limited set of supported locales, wchar_t can be very small, up to one byte with UTF-8 encoding (a nightmare opportunity, where, for example, the standard character classification function will not work outside of ASCII A through Z, but Hey). Perhaps the following requirement should be wider:

C99 §7.1.1 / 4:

" A wide character is the code value (binary integer) of an object of type wchar_t , which corresponds to a member of the extended character set.

& hellip; as it refers to an extended character set, but the term is apparently no longer defined.

And at least during Microsoft C and C ++, there is no unicode locale: with this implementation, setlocale limited to character encoding, which has no more than 2 bytes per character:

MSDN ⁴ setlocale documentation:

" . The set of available language names, languages, country / region codes and code pages includes all those that are supported by the Windows NLS API, with the exception of code pages that require more than two bytes per character, such as UTF-7 and UTF- 8. If you specify a UTF-7 or UTF-8 codepage value, setlocale will fail with an NULL return.

So it seems that contrary to what I thought I knew, and contrary to my statement, 16-bit Windows wchar_t Windows is formally formal. And mainly because of Microsoft's impeccable support for UTF-8 locales or any language with more than 2 bytes per character. But is this really 16-bit wchar_t OK?


References:
¹ news: comp.lang.C ++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

+7
c ++ c encoding winapi unicode
source share
3 answers

wchar_t no longer exists and has never been a Unicode character / code point. The C ++ standard does not declare that a wide literal will contain Unicode characters. The C ++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard says nothing about what wchar_t will contain.

wchar_t can be used with language APIs, but they apply only to the encoding defined by the implementation, and not to any specific Unicode encoding. The standard library functions that use them use their knowledge of implementation coding to accomplish their tasks.

So, is this a 16-bit wchar_t legal? Yes; the standard does not require wchar_t to be large enough to contain Unicode code.

Is the string from wchar_t allowed to store UTF-16 values ​​(or, in general, the width of the variables)? Well, you are allowed to create wchar_t strings that store everything you want (as long as it fits). Thus, for the purposes of the standard, the question is whether the standard software is allowed to use UTF-16 for generating wchar_t characters and strings.

Well, the standard library can do whatever it wants; the standard does not guarantee that conversion from any particular character encoding to wchar_t will display 1: 1. Even charwchar_t wstring_convert through wstring_convert not required anywhere in the standard to create a 1: 1 character mapping.

If the compiler wants to declare that the wide character set consists of a basic multilingual Unicode plan, then an announcement like this L'\U0001F000' produces one wchar_t . But the value is determined by the implementation, for [lex.ccon] / 2:

The value of a wide literal containing a single c-char has a value equal to the numerical value of the c-char encoding in the broadcast execution set, unless c-char has a representation in the broadcast execution set, in which case the value is determined by the implementation.

And, of course, C ++ does not allow the use of surrogate pairs like c-char; \uD800 - compilation error.

In the conditions when everything becomes muddy in the standard, we are talking about processing strings containing characters outside the character set. The above text assumes that implementations can do what they want. And yet [lex.string] \ 16 says the following:

The size of char32_t or a wide string literal is the total number of escape sequences, universal character names, and other characters, plus one for the final U \ 0 or L \ 0.

I say this is grim because it says nothing about what should be the behavior if the c-char in the string literal is outside the range of the target character set.

Windows compilers (both VS and GCC-on-Windows) really cause L"\U0001F000" have an array size of 3 (two surrogate pairs and one NUL terminator). Is this the legal standard for C ++ behavior? What does it mean to provide c-char to a string literal that is out of range for a character set?

I would say that this is a hole in the standard, and not a flaw in these compilers. This should make it clearer what the conversion behavior should be in this case.


In any case, wchar_t not a suitable tool for processing Unicode-encoded text. This is not "formally permissible" to represent any Unicode form. Yes, many compilers implement wide format literals as Unicode encoding. But since the standard does not require this, you cannot rely on it.

Now, obviously, you can insert anything that fits inside wchar_t . Thus, even on platforms where wchar_t is 32 bits, you can insert UTF-16 data into them, with each 16-bit word occupying 32 bits. But you cannot pass such text to any API function that expects a wide-angle encoding unless you know that this is the expected encoding for this platform.

Basically, never use wchar_t if you want to work with Unicode encoding.

+2
source share

Let's start with the first principles:

(§3.7.3) <bit> wide character: a bit representation that is suitable for an object of type wchar_t, capable of representing any character in the current locale

(§3.7) symbol: element of a set of elements used to organize, control or present data

This immediately discards the full Unicode as a character set (a set of elements / characters) represented in 16-bit wchar_t .

But wait, Nicole Bolas quoted the following :

The size of char32_t or a wide string literal is the total number of escape sequences, universal character names, and other characters, plus one to end with U \ 0 or L \ 0.

and then wondered about the behavior of characters outside the set of execution characters. Well, C99 can say the following:

(§5.1.1.2) Each source character set element and escape sequence in a constant character and string literals are converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an element defined for implementation that is different from the zero (wide) character .8)

and further explains in a footnote that not all source characters must be matched with the same execution character.

Armed with this knowledge, you can claim that your wide range of execution characters is the basic multilingual plane, and you consider surrogates to be your own characters, and not as surrogates for other characters. AFAICT, this means that you are well versed in Section 6 (Language) of ISO C99.

Of course, do not expect Section 7 (Library) to work well with you. As an example, consider iswalpha(wint_t) . You cannot pass astral characters (characters outside of BMP) to this function, you can only pass two surrogates. And you will get some kind of meaningless result, but that's fine, because you yourself declared a surrogate to be the correct members of the execution character set.

0
source share

After clarifying the issue, I will do the editing. Q: Is 16 bit width for wchar_t on Windows compliant?

A: Well, let's see. We will start by defining wchar_t from the c99 project.

... the largest extended character set specified among local supported ones .

So, we have to look at what supported locales. There are three steps to this:

  • We check the documentation on setlocale
  • We quickly open the documentation for the locale line. We see the line format

     locale :: "locale_name" | "language[_country_region[.code_page]]" | ".code_page" | "C" | "" | NULL 
  • We see a list of supported code pages , and we see UTF-8, UTF-16, UTF-32 and so, we are at a dead end.

If we start with the definition of C99, it ends with

... matches an extended character set member.

The word character set is used. But if we say that the code units of UTF-16 are our character set, then everything is fine. Otherwise, it is not. This is a bit vague and I don't care. Standards were defined many years ago when Unicode was not popular.

At the end of the day, we have C ++ 11 and C11 that define use cases for UTF-8, 16, and 32 with additional types char16_t and char32_t.


You need to read about Unicode, and you yourself will answer the question.

Unicode is a set of characters. The character set is about 200,000 characters. Or, more precisely, it is a mapping, a comparison between numbers and symbols. Unicode alone does not mean one bit width or another.

Then there are 4 encodings: UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode Conversion Format. Each format defines a code point and a code block. A code point is the actual charter of Unicode and may consist of one or more units. Only UTF-32 has one block per point.

On the other hand, each block fits into a fixed integer. Thus, UTF-7 devices have no more than 7 bits, UTF-16 have no more than 16 bits, etc.

Therefore, in the 16-bit wchar_t line, we can store Unicode text encoded in UTF-16. In particular, in UTF-16, each point receives one or two blocks.

So, the final answer is, in one wchar_t you cannot store all Unicode char, only separate units, but in the wchar_t line you can save any text in Unicode.

-one
source share

All Articles