C ++ 11 std :: cout << "string literal in UTF-8" in Windows cmd console? (Visual studio 2015)

Summary. . What should I do to correctly print a string literal defined in the source code that was saved in UTF-8 encoding (Windows CP 65001) to the cmd console using the std::cout stream?

Motivation: I would like to change the excellent Catch unit-testing framework (as an experiment), so that it will display my texts with accented characters. Modification should be simple, reliable, and should also be useful for other languages ​​and work environments so that it can be accepted by the author as an improvement. Or if you know Catch and if there is some alternative solution, can you post it?

Details: Start with the Czech version of the "fast brown fox ..."

 #include <iostream> #include "windows.h" using namespace std; int main() { cout << "\n-------------------------- default cmd encoding = 852 -------------------\n"; cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl; cout << "\n-------- Windows Central European (1250) set for the cmd console --------\n"; SetConsoleOutputCP(1250); std::cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << std::endl; cout << "\n------------- Windows UTF-8 (65001) set for the cmd console -------------\n"; SetConsoleOutputCP(CP_UTF8); std::cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << std::endl; } 

It prints the following (font installed in Lucida Console): enter image description here

The default cmd encoding is 852, the standard Windows encoding is 1250, and the source code was saved using the 65001 encoding (UTF-8 with specification). SetConsoleOutputCP(1250); changes the encoding of cmd (programmatically) in the same way as chcp 1250 .

Observation: When setting the encoding 1250, the UTF-8 string literal is printed correctly. I think this can be explained, but it is really strange. Is there a decent, human, general way to solve the problem?

Update: "narrow string literal" is stored using the Windows-1250 encoding in my case (built-in Windows encoding for Central European countries). It seems to be independent of the encoding of the source code. The compiler saves it in native Windows encoding. Because of this, switching cmd to this encoding gives the desired result. This is illegal, but how can I get my own Windows encoding programmatically (pass it to SetConsoleOutputCP(cpX) )? I need a constant that is valid for the machine where the compilation took place. It should not be a native encoding for the machine on which the executable is running.

In C ++ 11, u8"the UTF-8 string literal" also added, but it is not suitable for SetConsoleOutputCP(CP_UTF8);

+5
source share
2 answers

This is a partial answer, found by jumping the luk32 link and confirming Melebius' comments (see question below). This is not a complete answer, and I will be glad to accept your subsequent comment.

I just found a UTF-8 Everywhere Manifesto that is affecting the issue. Point 17. Q: How to write a UTF-8 string literal in my C ++ code? says (also explicit to the Microsoft C ++ compiler):

However, the easiest way is to simply write the as-is line and save the source file encoded in UTF-8:

  "∃y ∀x ¬(x ≺ y)" 

Unfortunately, MSVC converts it to some ANSI encoding, distorting the string. To work around this, save the file in UTF-8 without specification. MSVC will assume that it is in the correct code page and will not touch your lines. However, this makes it impossible to use Unicode identifiers and wide string literals (which you will not use in any case).

I really like the manifesto. To make it short, using rude words and possibly simplifying it, he says:

Ignore wstring , wchar_t etc. Ignore code pages. Ignore string literal prefixes such as L , u , u , u8 . Use UTF-8 everywhere. Write all the literals "naturally" . "naturally" Make sure it is also saved in the compiled binary.

If the following code is stored in UTF-8 without specification ...

 #include <iomanip> #include <iostream> #include "windows.h" using namespace std; int main() { SetConsoleOutputCP(CP_UTF8); cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl; int cnt = 0; for (unsigned int c : "Příšerně žluťoučký kůň úpěl ďábelské ódy!") { cout << hex << setw(2) << setfill('0') << (c & 0xff); ++cnt; if (cnt % 16 == 0) cout << endl; else if (cnt % 8 == 0) cout << " | "; else if (cnt % 4 == 0) cout << " "; else cout << ' '; } cout << endl; } 

It prints (must be UTF-8 encoded) ...

enter image description here

When saving the source as UTF-8 with specification, it prints another result ...

enter image description here

However, the problem remains - how to programmatically configure console encoding so that the UTF-8 string is printed correctly.

I gave up. The cmd console is simply crippled and should not be fixed externally. I accept my own comment only to close the question. If someone finds a decent solution related to the Catch unit test card (it may be completely different), I will be happy to accept his / her comment as an answer.

+2
source

The MSVC compiler is trying to encode your constant strings in your code using local encoding. In your case, it uses code page 852 . Thus, even your cmd output tries to read and output the line using code page 1250 , in fact the line is stored with code page 852 . This incompatibility between storage and reading creates incorrect output.
One way to solve this problem is to save the string in a file encoded with code page 1250 . Visual Studio Code provides this functionality. You can read the file as a binary file (i.e., byte by byte) into the char buffer, and then output the buffer.

 char * memblock = new char[1024]; std::ifstream file("src.txt", std::ios::in | std::ios::binary | std::ios::ate); int size; if (file.is_open()) { size = file.tellg(); memblock = new char[size]; file.seekg(0, std::ios::beg); file.read(memblock, size); file.close(); } else { std::cout << "File not opened." << std::endl; } memblock[size] = 0; std::cout << memblock << std::endl; 

enter image description here

0
source

All Articles