Why doesn't a euro sign string inside a string literal using UTF8 create a UCN?

Question

Why doesn't a euro sign string inside a string literal using UTF8 create a UCN?

The spectrum says that in stage 1 of compilation

Any source symbol of the source that does not contain the basic source symbol set (2.3) is replaced with the name of the universal symbol that denotes this symbol.

And in step 4 he says

Preprocessor directives are executed, macro commands are expanded.

At stage 5, we have

Each character set element in a character literal or string literal, as well as each escape sequence and the universal character-name in a character literal or jagged string literal, are converted to the corresponding character set execution element

For the operator # we have

Symbol
a \ inserted before each character " and \ character literal or string literal (including the separator characters " ).

Therefore, I conducted the following test

 #define GET_UCN(X) #X GET_UCN("€")

Using the input character set UTF-8 (corresponding to the encoding of my file), I expected the following result of the preprocessor of operation #X : "\"\\u20AC\"" . GCC, Clang and boost.wave do not convert € to UCN and instead give "\"€\"" . I feel like I'm missing something. Could you explain?

+7

c ++ c-preprocessor

Johannes Schaub - litb Jun 24 '11 at 3:28

source share

4 answers

Potatoswatter · Answer 1 · 2011-06-24T19:22:41+0000

This is just a mistake. § 2.1 / 1 refers to Phase 1,

(An implementation can use any internal encoding if the actual extended character is found in the source file and the same extended character expressed in the source file as a universal character-name (that is, using the \ uXXXX notation) is treated equivalently.)

This is not a note or footnote. C ++ 0x adds an exception for raw string literals, which can solve your problem at hand if you have one.

This program clearly shows the malfunction:

 #include <iostream> #define GET_UCN(X) L ## #X int main() { std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n'; }

http://ideone.com/lb9jc

Since both lines are wide, the first should be damaged by several characters if the compiler cannot interpret the sequence of input of several bytes. In this example, a complete lack of support for UTF-8 could cause the compiler to split the echo sequentially.

Windows programmer · Answer 2 · 2011-06-24T04:22:45+0000

"and the name of the generic character in a character literal or jagged string literal is converted to the corresponding member of the execution character set

earlier

"or the universal symbol-name in character literals and string literals is converted to a member of the execution character set

Maybe you need a future version of g ++.

Adam rosenfield · Answer 3 · 2011-06-24T05:18:50+0000

I'm not sure where you got this quote for translation phase 1 - the C99 standard talks about this in phase 1 of the translation in §5.1. 1.2 / 1:

Multibyte characters of the physical source file are mapped according to the implementation using the source character set (by entering newline characters for end-of-line indicators), if necessary. Trigraph sequences are replaced by the corresponding single-character internal representations.

Thus, in this case, the euro symbol € (represented as a multi-byte sequence of E2 82 AC in UTF-8) is mapped to the execution character set, which is also UTF-8, so its representation remains unchanged. It does not translate into the universal name of the symbol, because, well, nothing is said there that it owes.

Frank boyne · Answer 4 · 2011-06-24T04:17:24+0000

I suspect that you will find that the euro sign does not satisfy the condition Any source ﬁle character not in the basic source character set , so the rest of the text you cite does not apply.

Open the test file with your favorite binary editor and check what value is used to represent the euro sign in GET_UCN("€")

Why doesn't a euro sign string inside a string literal using UTF8 create a UCN?

More articles: