How does file encoding affect C ++ 11 string literals?

Question

How does file encoding affect C ++ 11 string literals?

You can write UTF-8/16/32 string literals in C ++ 11, the string literal prefix with u8 / u / u respectively. How should the compiler interpret a UTF-8 file with non-ASCII characters inside these new types of string literals? I understand that the standard does not specify file encodings, and this fact would allow to fully interpret non-ASCII characters inside undefined source code, making this function a little less useful.

I understand that you can still escape single Unicode characters using \uNNNN , but this is not very readable, say, for a complete Russian or French sentence, which usually contains more than one Unicode character.

What I understand from different sources is that u should become equivalent to L for current Windows implementations and u e.g. Linux implementations. Therefore, keeping this in mind, I am also interested in what is required for old string literal modifiers ...

For monkeys with a sample code:

 string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!"; string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!"; string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

In an ideal world, all these lines produce the same content (for example, characters after conversion), but my experience with C ++ taught me that this is definitely a specific implementation, and maybe only the first one will do what I want,

+10

c ++ encoding c ++ 11 string-literals

rubenvb Jul 22 '11 at 18:40

source share

3 answers

How should the compiler interpret a UTF-8 file with non-ASCII characters inside these new types of string literals. I understand that the standard does not specify file encodings, and this fact would allow to fully interpret non-ASCII characters inside undefined source code, making this function a little less useful.

From n3290, 2.2 Translation Phases [lex.phases]

Images of the physical source file are mapped to the implementation-defined base character set of the source (newline characters for line-of-line indicators), if necessary. The set of received characters of the physical source implementation file. [Here is a little about trigraphs.] Any source file symbol not in the base source character set (2.3) is replaced with a universal symbolic name that denotes this symbol. (An implementation can use any internal encoding if the actual extended character found in the source file and the same extended character expressed in the source file as the name of the universal character (i.e. using the notation \ uXXXX) are, unless when this replacement returns to raw string literal.)

There are many standard terms used to describe how an implementation relates to encodings. Here is my attempt at a somehow simpler, step-by-step description of what is happening:

Images of the physical source file are displayed in a specific implementation, to the base character set of the source [...]

The problem of file encodings is performed manually; The standard only cares about the basic character set and leaves room for implementation to get there.

Any source file symbol not in the basic source character set (2.3) is replaced by the name of the universal symbol that denotes this symbol.

The basic set of sources is a simple list of valid characters. This is not ASCII (see below). Everything that is not on this list is "transformed" (at least conceptually) into \uXXXX form.

So, no matter what type of literal or file is used, the source code is conceptually converted to a basic character set + a set of \uXXXX . I speak conceptually, because what the implementation actually implements is usually simpler, for example. because they can access Unicode directly. The important part is that the Standard calls an extended character (i.e., Not from the base source set), must be indistinguishable from using its equivalent form \uXXXX . Please note that C ++ 03 is available, for example. EBCDIC, so your reasoning in ASCII terms is wrong in the transition process.

Finally, the described process occurs with (not raw) string literals. This means that your code is equivalent as if you wrote:

 string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!"; string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!"; string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";

+4

Luc Danton Jul 22 '11 at 20:24

source share

In principle, coding issues matter only when you output your lines, making them visible to people, which is not a question of how a programming language is defined, since its definition only concerns coding calculation. So, when you decide whether what you see in your editor will be the same as what you see in the output (any images, whether on the screen or in pdf), you should ask yourself what agreement how assumed your user interaction library and your operating system. (Here, for example, such information for Qt5 : with Qt5, what you see as the application user and what you see as its programmer is the same if the contents of the old-fashioned string literals for your QStrings are encoded as utf8 in your source files, unless you include another parameter during application execution).

As a conclusion, I believe that Kerrek SB is right, and Damon is wrong: indeed, the methods for specifying a literal in the code should indicate its type, and not the encoding that is used in the source file to fill its contents, since the type of the literal is what concerns computing. Something like u"string" is just an array of "unicode codeunits" (that is, values of type char16_t ), regardless of which operating system or some other utility software subsequently does with them, and nevertheless , their work is looking for you or another user, you just get the problem of adding another agreement for yourself, which makes a correspondence between the "value" of the calculated numbers (namely, they represent Unicode codes) and their representation on your screen when working in a text editor, How and if you, as a programmer, use If you use this “value”, this is another question, and how you could apply this other correspondence will naturally be determined by the implementation, since it has nothing to do with calculating the encoding, only with the comfort of using the tool.

0

Evgeniy Oct 10 '15 at 15:02

source share

Kerrek SB · Accepted Answer · 2011-07-22 18:45

In GCC, use -finput-charset=charset :

Specify the input character set used to translate from the character set of the input file to the original character set used by GCC. If the locale does not indicate, or GCC cannot retrieve this information from the locale, UTF-8 is used by default. This can be overridden by either the locale parameter or this command line. Currently, the command line option takes precedence if there is a conflict. charset can be any encoding supported by the "iconv" system library.

Also check the -fexec-charset and -fwide-exec-charset .

Finally, about string literals:

 char a[] = "Hello"; wchar_t b[] = L"Hello"; char16_t c[] = u"Hello"; char32_t d[] = U"Hello";

The string literal size modifier ( L , u , u ) simply determines the type of the literal.

How does file encoding affect C ++ 11 string literals?

More articles: