Polish characters in std :: string

I have a problem. I am writing an application in Polish (from, of course, Polish) for Linux, and when compiling I get 80 warnings. This is just a “warning: multi-character character constant” and “warning: label label value exceeds the maximum value for the type”. I am using std :: string.

How to replace class std :: string?

Please, help. Thank you in advance. Best wishes.

+4
source share
3 answers

std::string does not define a specific encoding. Thus, you can store any sequence of bytes in it. There are subtleties that you need to know about:

  • .c_str() will return a buffer with zero completion. If your character set is nullable bytes, do not pass this string to functions that accept the const char* parameter without length, or your data will be truncated.
  • A char does not represent a character, but ** bytes . IMHO, this is the most problematic nomenclature in the history of computing. Note that wchar_t must contain the full character either, depending on the normalization of UTF-16.
  • .size() and .length() will return the number of bytes, not the number of characters.

[edit] case label warnings relate to issue (2). You are using a switch with multibyte characters, using a char type that cannot contain more than one byte. [/ edit]

Therefore, you can use std::string in your application, provided that you comply with these three rules. There are subtleties associated with STL, including std::find() , which is a consequence of this. For proper Unicode support due to normalization forms, you need to use some more smart string matching algorithms.

However, when writing applications in any language that uses non-ASCII characters (if you are paranoid, consider it somehow outside [0, 128) ), you need to know the encodings in different text data sources.

  • The encoding source-file cannot be specified and can be changed using compiler options. Any string literal will obey this rule. I think that’s why you get warnings.
  • You will receive many character encodings from external sources (files, user input, etc.). When this source indicates the encoding, or you can get it from some external source (i.e., By asking the user who imports the data), then it's easier. Many (newer) Internet protocols impose ASCII or UTF-8, unless otherwise specified.

These two questions are not addressed by any particular string class. You just need to convert all external sources to internal encoding. I offer UTF-8 all the time, but especially on Linux because of my own support. I highly recommend putting your string literals in a message file to forget about problem (1) and deal only with problem (2).

I do not suggest using std::wstring for Linux, because 100% of the built-in APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t , you will need to convert to / from std::wstring without stopping and end up getting something wrong instead of doing everything slower (er).

If you are writing a Windows application, I would suggest the exact opposite, because all native APIs use the const wchar_t* signatures. The ANSI version of such functions performs internal conversion to / from const wchar_t* .

Some "portable" libraries / languages ​​use different views on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I remember reading this trick in implementing Python references, but the article was pretty old. I am not sure if this is true.

+4
source

On linux, you must use the multibyte string class provided by the framework you are using.

I would recommend Glib :: ustring from the glibmm framework, which stores strings in UTF-8 encoding. If the source files are in UTF-8, then using a multibyte string literal in the code is as simple as:

 ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż"); 

But you cannot create a switch / case statement for multibyte characters with char . I would recommend using the if s series. You can use Glibmm gunichar , but it is not very readable (you can get the correct Unicode values ​​for characters using the table from the Polish alphabet article on Wikipedia ):

 #include <glibmm.h> #include <iostream> using namespace std; int main() { Glib::ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż"); int small_polish_vovels_with_diacritics_count = 0; for ( int i=0; i<alphabet.size(); i++ ) { switch (alphabet[i]) { case 0x0105: // ą case 0x0119: // ę case 0x00f3: // ó small_polish_vovels_with_diacritics_count++; break; default: break; } } cout << "There are " << small_polish_vovels_with_diacritics_count << " small polish vovels with diacritics in this string.\n"; return 0; } 

You can compile this using:

 g++ `pkg-config --cflags --libs glibmm-2.4` progname.cc -o progname 
+1
source

std::string - for ASCII strings. Since your polishing strings are not suitable, you should use std::wstring .

-1
source

All Articles