Writing Unicode to a file in C ++

Question

Writing Unicode to a file in C ++

I have a problem with writing unicode to a file in C ++. I want to write a few emoticons to a file with my own extension, which you can get by typing ALT + NUMPAD (2). I can map it to CMD by making a char and assigning it the value "\ 2", and it will display a smiley, but it will not write it to a file.

Here is the code snippet for my program:

ofstream myfile; myfile.open("C:\Users\My Username\test.exampleCodeFile"); myfile << "\2"; myfile.close();

It will write to the file, but it does not display what I want. I would show you what it displays, but StackOverflow will not let me display the character. Thanks in advance.

+4

c ++ unicode ofstream writetofile

Garrett ratliff Apr 9 '13 at 19:47

source share

3 answers

Remy lebeau · Answer 1 · 2013-04-09T22:55:17+0000

ALT + NUMPAD2 is not the same as the ASCII 2 character, which writes your file to the file. ALT codes are how DOS handles non-ASCII characters. The character displayed by CMD.COM for ALT + NUMPAD2 is actually Unicode U + 263B "BLACK SMILING FACE" encoding. As a Unicode character, it is best to encode a file using UTF-8 or UTF-16, for example:

 ofstream myfile; myfile.open("C:\\Users\My Username\\test.txt"); myfile << "\xEF\xBB\xBF"; // UTF-8 BOM myfile << "\xE2\x98\xBB"; // U+263B myfile.close();

.

 ofstream myfile; myfile.open("C:\\Users\\My Username\\test.txt"); myfile << "\xFF\xFE"; // UTF-16 BOM myfile << "\x3B\x26"; // U+263B myfile.close();

Both approaches show a smiley in Notepad (provided that you use a font that supports emoticons), since it first reads the specification and then decodes the Unicode code accordingly based on this.

Mark tolonen · Answer 2 · 2013-04-09T23:51:21+0000

You must use Unicode to specify the characters you want to display. The character represented by byte 02h in the console is converted by codepage 437 ( cp437 ) to the Unicode character U+263B . Using the source file stored in UTF-8 using the specification simplifies the use of Unicode, since you can insert or enter the desired characters without resorting to Unicode escape codes.

For a file stream, the stream must be configured for UTF-8. There are various ways to do this, and it depends on the compiler, but using Visual Studio 2012, a source stored in UTF-8 w / BOM, and a bit of Googling:

 #include <locale> #include <codecvt> #include <fstream> #include <iostream> #include <io.h> #include <fcntl.h> using namespace std; int main() { const std::locale utf8_locale = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>()); wofstream f(L"sample.txt"); f.imbue(utf8_locale); f << L"\u263b我是美国人。我叫马克。" << endl; _setmode(_fileno(stdout),_O_U16TEXT); wcout << L"\u263b我是美国人。我叫马克。" << endl; }

The contents of sample.txt , as shown in Notepad:

 ☻我是美国人。我叫马克。

Hex dump (correct UTF-8):

 E68891E698AFE7BE8EE59BBDE4BABAE38082E68891E58FABE9A9ACE5858BE380820D0A

Pull out to the console cut and paste here. There was a visual display for every Chinese character without the correct font, but the characters displayed correctly inserted into SO or Notepad.

 ☻我是美国人。我叫马克。

Hans passant · Answer 3 · 2013-04-09T23:48:29+0000

You are using the exact opposite of Unicode. The console works with an 8-bit code page, by default on Western machines code page 437 . Which corresponds to the character set of the old IBM PC character ROM and is a code page expected by most previous DOS programs. The first set of character codes, codes 0 through 8 are as follows:

Pay attention to the emoticon for code 0x02, the one that you saw on the console. You can see the rest of the glyphs in this Wikipedia article . The disgusting problem with 8-bit character encodings is that there are so many of them. Notepad reads your file with a different code page. By default, Windows-1252 is on machines in Western Europe and America. There are no glyphs for control codes on this page, so you do not see the emoticon in Notepad.

Working with code pages is a major headache. This is why Unicode was invented.

It is possible to switch the console to the Unicode code page. However, it must be an 8-bit encoding, another one inherited from console programs that support output redirection. What makes the right choice is utf-8. You can switch from the console itself by typing chcp 65001 before starting your program. Or you can do this in your code, call SetConsoleOutputCP(CP_UTF8); .

Another bad detail you need to take care of, you also need to change the font used for the console. The default font is TERMINAL, an obsolete font that was designed to display IBM PC glyphs but does not know beans about Unicode. Use the system menu to switch (press Alt + Space, Properties), not so much to choose, but Consolas or Lucinda Console are suitable.

Now you can display Unicode, this is another story that Remy introduced.

Writing Unicode to a file in C ++

More articles: