wchar_t no longer exists and has never been a Unicode character / code point. The C ++ standard does not declare that a wide literal will contain Unicode characters. The C ++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard says nothing about what wchar_t will contain.
wchar_t can be used with language APIs, but they apply only to the encoding defined by the implementation, and not to any specific Unicode encoding. The standard library functions that use them use their knowledge of implementation coding to accomplish their tasks.
So, is this a 16-bit wchar_t legal? Yes; the standard does not require wchar_t to be large enough to contain Unicode code.
Is the string from wchar_t allowed to store UTF-16 values (or, in general, the width of the variables)? Well, you are allowed to create wchar_t strings that store everything you want (as long as it fits). Thus, for the purposes of the standard, the question is whether the standard software is allowed to use UTF-16 for generating wchar_t characters and strings.
Well, the standard library can do whatever it wants; the standard does not guarantee that conversion from any particular character encoding to wchar_t will display 1: 1. Even char → wchar_t wstring_convert through wstring_convert not required anywhere in the standard to create a 1: 1 character mapping.
If the compiler wants to declare that the wide character set consists of a basic multilingual Unicode plan, then an announcement like this L'\U0001F000' produces one wchar_t . But the value is determined by the implementation, for [lex.ccon] / 2:
The value of a wide literal containing a single c-char has a value equal to the numerical value of the c-char encoding in the broadcast execution set, unless c-char has a representation in the broadcast execution set, in which case the value is determined by the implementation.
And, of course, C ++ does not allow the use of surrogate pairs like c-char; \uD800 - compilation error.
In the conditions when everything becomes muddy in the standard, we are talking about processing strings containing characters outside the character set. The above text assumes that implementations can do what they want. And yet [lex.string] \ 16 says the following:
The size of char32_t or a wide string literal is the total number of escape sequences, universal character names, and other characters, plus one for the final U \ 0 or L \ 0.
I say this is grim because it says nothing about what should be the behavior if the c-char in the string literal is outside the range of the target character set.
Windows compilers (both VS and GCC-on-Windows) really cause L"\U0001F000" have an array size of 3 (two surrogate pairs and one NUL terminator). Is this the legal standard for C ++ behavior? What does it mean to provide c-char to a string literal that is out of range for a character set?
I would say that this is a hole in the standard, and not a flaw in these compilers. This should make it clearer what the conversion behavior should be in this case.
In any case, wchar_t not a suitable tool for processing Unicode-encoded text. This is not "formally permissible" to represent any Unicode form. Yes, many compilers implement wide format literals as Unicode encoding. But since the standard does not require this, you cannot rely on it.
Now, obviously, you can insert anything that fits inside wchar_t . Thus, even on platforms where wchar_t is 32 bits, you can insert UTF-16 data into them, with each 16-bit word occupying 32 bits. But you cannot pass such text to any API function that expects a wide-angle encoding unless you know that this is the expected encoding for this platform.
Basically, never use wchar_t if you want to work with Unicode encoding.