Should char16_t strings use UTF-16 encoding?

I’ve been digging a spec for a long time and can’t find any final offers for yes / no support.

Does the following instruction:

char16_t *s = u"asdf"; 

to imply / ensure that the string literal "asdf" must be encoded in UTF-16?

Of all that I can deduce, yes.

However, in this n2018 sentence , it says only when __STDC_UTF_16__ defined that char16_t literals are UTF-16 encoded, so the leaves open the door when __STDC_UTF_16__ is undefined, char16_t literals can be encoded anyway, the compiler wants.

After all, the standard only guarantees the size, signature, and basic representation of char16_t ; it says nothing about how the compiler should encode a literal or text literal char16_t .

The specification states

The size of the string literal char16_t is the total number of escapements of the sequence, the names of universal characters and other characters, plus one for each character requiring a surrogate pair, plus one for ending and \ 0. [Note: string literal size char16_t number of code units, not the number of characters. -end note]

This means that it is understood that char16_t string literals are encoded by UTF16 because the “surrogate pair” is a concept of UTF-16.

Let me know if there is anything vague in the question.

+6
source share
2 answers

The __STDC_UTF_16__ bits __STDC_UTF_16__ not fall into standard text. That is, in the sentence, probably because it was taken from a similar sentence for the C language. The C ++ standard simply deleted all this crap and made UTF-16 or GTFO.

+6
source

The standard is technically independent of the base encoding and indicates only that the value of one char16_t should correspond to the UCS code node in the range 0 ~ 0xFFFF

§ 2.14.3

2 A character literal starting with the letter u, such as uy, is a character literal of type char16_t. The character value char16_t containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with one 16-bit code.

Alternatively, strings may include surrogate pairs

§ 2.14.5

10 A string literal starting with u, for example u "asdf", is a char16_t string literal. The string literal char16_t is of type "array of n const char16_t", where n is the size of the string, as defined below; static storage duration and is initialized with the specified characters. One c-char can create more than one char16_t character in the form of surrogate pairs.

Only UTF-16 satisfies both of these requirements, although the standard leaves the door open for future compatible encodings, as far as this is unlikely.

+5
source

All Articles