Unicode String Indexing in C ++

I came from python where you can use 'string [10]' to access a character in a sequence. And if the string is encoded in Unicode, this will give me the expected results. However, when I use indexing on a string in C ++, while ASCII characters work, but when I use a Unicode character inside a string and use indexing, I get an octal representation like / 201 in the output. For instance:

string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";

Conclusion:

ÐðŁłŠšÝýÞþŽž
/201

Why is this happening and how can I access this character in a lowercase representation, or how can I convert an octal representation to an actual character?

+3
source share
5 answers

++ Unicode, , , .

, ++ Unicode . , , (, , ASCII-7 @, $, backtick).

++ 98 Unicode. wchar_t wstring, , wchar_t " ". , ...

Microsoft wchar_t 16 , Unicode . Unicode 16- ... 16- Windows wchar_t "", BMP - Microsoft, , , wchar_t UTF-16 ( ) UCS-2 ( BMP).

Linux wchar_t 32 , UTF-32...

++ 11 , char16_t char32_t, string, , Unicode.

, , . "Fuß" , , . ( 'ß' 'SS', - , ).

. Unicode (ICU) Unicode ++. , u8"", u"" u"" UTF-8, UTF-16 UTF-32 , /hexadecimal escapes , ASCII-7.

std::cout << ramp[5], ++ . ICU ustream.h operator<< icu::UnicodeString, ramp[5] - 16- (1), , unsigned short , C-API u_fputs()/u_printf()/u_fprintf().

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>

#include <iostream>

int main()
{
    // make sure your source file is UTF-8 encoded...
    icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
    std::cout << ramp << "\n";
    std::cout << ramp[5] << "\n";
    u_printf( "%C\n", ramp[5] );
}

g++ -std=c++11 testme.cpp -licuio -licuuc.

ÐðŁłŠšÝýÞþŽž
353
š

(1) ICU UTF-16 , UnicodeString::operator[] , , . API .

+10

++ Unicode. , ICU.

+5

, u32string, UTF-32 char32_t.

u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";
+2

, , cplusplus.com :

, : (, UTF-8), ( ) - ( ).

, : ICU, ++ 11; u32string, .

0

, . , : , ramp[5] , 5 . API.

A similar problem occurs if you want to get the row size. Should it be the number of characters (or code point) or just the number of bytes? Usually you need a size to allocate a buffer, so the number of bytes is more desirable. You only very, very rarely have to get the number of Unicode characters.

If you want to handle UTF-8 encoded strings using iterators, I would definitely recommend UTF8-CPP .

0
source

All Articles