Printing Unicode characters in C ++

I need to print some unicode characters on a Linux terminal using iostream . Strange things happen. When I write:

 cout << "\u2780"; 

I get: βž€ that is almost what I want. However, if I write:

 cout << '\u2780'; 

I get: 14851712 .

The problem is that I don’t know which character to print at compile time. So I would like to do something like:

 int x; // some calculations... cout << (char)('\u2780' + x); 

Which prints: . Using wcout or wchar_t instead does not work either. How to get the right print?

From what I found on the Internet, it seems important that I use the g ++ 4.7.2 compiler directly from the Debian Wheezy repository.

+7
source share
4 answers

The Unicode character \u2780 is out of range for the char data type. You should have received this compiler warning to tell you this: (at least my g ++ 4.7.3 gives it)

 test.cpp:6:13: warning: multi-character character constant [-Wmultichar] 

If you want to work with characters such as U + 2780, as separate units, you will have to use the wide format data type wchar_t , or if you are lucky enough to work with C ++ 11, char32_t or char16_t . Note that a single 16-bit block is not enough to represent the entire range of Unicode characters.

If this does not work for you, perhaps because the default locale "C" does not support non-ASCII output. To fix this problem, you can call setlocale at the beginning of the program; this way you can display the full range of characters supported by the user's language version: (which may or may not have support for all the characters you use)

 #include <clocale> #include <iostream> using namespace std; int main() { setlocale(LC_ALL, ""); wcout << L'\u2780'; return 0; } 
+6
source

When you write

 cout << "\u2780"; 

The compiler will convert \ u2780 to the corresponding encoding of this character in the execution character set. This is probably UTF-8, and so the line ends with four bytes (three for the character, one for the null terminator).

If you want to generate a character at runtime, you need to somehow make at runtime the same conversion to UTF-8 as at compile time.


C ++ 11 provides a convenient wstring_convert template and codecvt codecs that can do this, however libstdC ++, the standard library implementation shipped with gcc, has not yet managed to implement them (starting with gcc 4.8). The following shows how to use these functions, but you need to either use another standard library implementation or wait until libstdC ++ can implement them.

 #include <codecvt> int main() { char32_t base = U'\u2780'; std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert; std::cout << convert.to_bytes(base + 5) << '\n'; } 

You can also use any other way to create UTF-8 that you have. For example, iconv, ICU, and manual use of preec C ++ 11 codecvt_byname faces will work. (I am not showing these examples because this code will be more complex than the simple code allowed by wstring_convert .)


An alternative that would work for a small number of characters would be to create an array of strings using literals.

 char const *special_character[] = { "\u2780", "\u2781", "\u2782", "\u2783", "\u2784", "\u2785", "\u2786", "\u2787", "\u2788", "\u2789" }; std::cout << special_character[i] << '\n'; 
+4
source

The program prints an integer due to C ++ 11 Β§2.14.3 / 1:

A multi-character literal or ordinary character literal containing a single c-char that cannot be represented in the execution character set is conditionally supported, is of type int, and has a value defined by the implementation.

The execution character set is what char can represent, i.e. ASCII.

You have 14851712 or in hexadecimal e29e80 representing UTF-8 representing U + 2780. Putting UTF-8, a multibyte encoding in int is crazy and stupid, but this is what you get from the "conditionally supported, implementation-specific" function.

To get the value of UTF-32, use U'\u2780' . The first U indicates the type char32_t and the encoding UTF-32 (i.e. up to 31 bits, but not surrogate pairs). The second \u indicates the name of the universal character containing the code point. To get a value supposedly compatible with wcout , use L'\u2780' , but this does not necessarily use the Unicode runtime value and cannot contain more than two bytes of memory.

As for the reliable manipulation and printing of Unicode encoding code, as other answers have already noted, the C ++ standard has not yet reached it. Joni's answer is the best way, but it still assumes that the compiler and user environment use the same language, which is often not the case.

You can also specify UTF-8 strings in the source with u8"\u2780" and force the UTF-8 runtime to use something like std::locale::global( std::locale( "en_US.UTF-8" ) ); . But it still has rough edges. Joni suggests using the C interface std::setlocale from <clocale> instead of the C ++ interface std::locale::global from <locale> , which is a workaround for the C ++ interface, which is broken in GCC on OS X and, possibly on other platforms. These issues are fairly platform sensitive, so your Linux distribution could add the patch to your own GCC package.

0
source

On Linux, I successfully printed any Unicode directly, as in the most naive way:

 std::cout << "ΐ , Ξ‘, Ξ’, Ξ“, Ξ”, ,Θ , Ξ›, Ξ, ... Β±, ... etc" 
0
source

All Articles