Explanation required for UTF-8 and cpp

Question

Explanation required for UTF-8 and cpp

I have Microsoft Visual Studio 2010 on Windows 7 64bit. (In the project properties, the "Character Set" is set to "Not Installed", however, each setting leads to the same output.)

Source:

using namespace std; char const charTest[] = "árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP\n"; cout << charTest; printf(charTest); if(set_codepage()) // SetConsoleOutputCP(CP_UTF8); // *1 cerr << "DEBUG: set_codepage(): OK" << endl; else cerr << "DEBUG: set_codepage(): FAIL" << endl; cout << charTest; printf(charTest);

* 1: Including windows.h messed up things, so I enable it from a separate cpp.

The compiled binary contains the string as the correct sequence of UTF-8 bytes. If I installed the console in UTF-8 using chcp 65001 and issued type main.cpp , the line will display correctly.

Test (the console is configured to use the Lucida Console font):

 D:\dev\user\geometry\Debug>chcp Active code page: 852 D:\dev\user\geometry\Debug>listProcessing.exe ├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP ├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP DEBUG: set_codepage(): OK   rv  zt  r   t  k  rf  r  g  p   RV  ZT  R   T  K  RF  R  G  P árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP

What is the explanation for this? Can I somehow ask cout to work as printf ?

ADDITION

Many say that the Windows console does not support UTF-8 characters at all. I am a Hungarian guy in Hungary, I have Windows installed in English (except for the date formats, they are set to Hungarian), and the Cyrillic letters are still displayed correctly with Hungarian letters:

Hungarian and Cyrillic letters on console at the same time

(My default code page is CP852)

+8

c ++ visual-studio utf-8

Notinlist Sep 22 '12 at 15:47

source share

4 answers

The command line seems to work with UTF-8 for my understanding

Font capable of displaying UTF-8 characters
Set the correct code page on the command line (chcp 65001), not sure if this code page supports all UTF-8 characters, but it seems the best available.

Check here and here.

[EDIT] actually 65001 is actually UTF-8 after I signed up for PowerShell

 PS C:\Users\forcewill> chcp 65001 Active code page: 65001 PS C:\Users\forcewill> [Console]::OutputEncoding BodyName : utf-8 EncodingName : Unicode (UTF-8) HeaderName : utf-8 WebName : utf-8 WindowsCodePage : 1200 IsBrowserDisplay : True IsBrowserSave : True IsMailNewsDisplay : True IsMailNewsSave : True IsSingleByte : False EncoderFallback : System.Text.EncoderReplacementFallback DecoderFallback : System.Text.DecoderReplacementFallback IsReadOnly : True CodePage : 65001

You can use PowerShell much more powerful than the old cmd.exe

Edit: About using cout, if we talk in visual studio, the correct answer is here , you can find a more detailed explanation here about best practices in visual studio

+2

forcewill 01 Oct '12 at 22:24

source share

On Windows, single-byte strings are usually interpreted as ASCII or some 256-character code page. This means that you will not get real Unicode support.

Short answer: use wide lines (for example, L""árvíztűr..." - pay attention to L) and then write to wcout instead of cout . Usually Windows interprets wide (2 bytes in Windows) lines as UTF-16 (or, at least a close option), so it will work as intended.Windows always uses wide strings to avoid encoding problems.

+1

Ashleysbrain Sep 22 '12 at 16:00

source share

First of all, the Windows console does not support UTF-8 (code page 65001, to test this, to open a UTF-8 encoded file that was saved using notepad in the console and you will see inactive data in the console), so it’s ok to To check your result, you should redirect it to a file or something like that and check the result (myapp> test.txt).

second in C / C ++ char [] is a sequence of characters that can be interpreted in any case what the programmer wants, but UTF-8 is a special protocol for encoding the Unicode character set, so there is no way (next to C ++ 11 ) that you write a sequence of characters and characters encoded in UTF8 because I will say char p[3] = "اب" , but if the compiler wants to code this in UTF-8, it needs 5 bytes, not 3. therefore you should use something that UTF-8 understands.

I suggest using boost::locale::conv::utf_to_utf with wide string constants. eg

 std::string sUTF8 = boost::locale::conv::utf_to_utf(L"árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP\n"); std::cout << sUTF8; // or printf( "%s", sUTF8.c_str() );

this ensures that you have a UTF-8 string, but again do not check it using the console, as it does not understand UTF-8 at all !!.

+1

Bigboss 01 Oct '12 at 12:37

source share

Sergei Nikulov · Accepted Answer · 2012-09-28T09:57:26+0000

The differences here are how the C ++ runtime and the C library handle the system language.

To achieve the same result with std :: cout you can try std :: ios :: imbue and std :: locale

But the main problem with utf-8 and C ++ is described here

C ++ 03 offers two types of string literals. The first kind, contained in double quotes, creates an array with a null character of type const char. The second type, defined as L "", creates an array with a null character of type const wchar_t, where wchar_t is a wide character. Neither a literal type supports string literals encoded with UTF-8, UTF-16, or any other type of Unicode encoding.

One way or another, this is all implementation-specific and therefore not portable, because non-standard C ++ output streams can understand utf-8.

Explanation required for UTF-8 and cpp

More articles: