std::string does not define a specific encoding. Thus, you can store any sequence of bytes in it. There are subtleties that you need to know about:
.c_str() will return a buffer with zero completion. If your character set is nullable bytes, do not pass this string to functions that accept the const char* parameter without length, or your data will be truncated.- A
char does not represent a character, but ** bytes . IMHO, this is the most problematic nomenclature in the history of computing. Note that wchar_t must contain the full character either, depending on the normalization of UTF-16. .size() and .length() will return the number of bytes, not the number of characters.
[edit] case label warnings relate to issue (2). You are using a switch with multibyte characters, using a char type that cannot contain more than one byte. [/ edit]
Therefore, you can use std::string in your application, provided that you comply with these three rules. There are subtleties associated with STL, including std::find() , which is a consequence of this. For proper Unicode support due to normalization forms, you need to use some more smart string matching algorithms.
However, when writing applications in any language that uses non-ASCII characters (if you are paranoid, consider it somehow outside [0, 128) ), you need to know the encodings in different text data sources.
- The encoding source-file cannot be specified and can be changed using compiler options. Any string literal will obey this rule. I think that’s why you get warnings.
- You will receive many character encodings from external sources (files, user input, etc.). When this source indicates the encoding, or you can get it from some external source (i.e., By asking the user who imports the data), then it's easier. Many (newer) Internet protocols impose ASCII or UTF-8, unless otherwise specified.
These two questions are not addressed by any particular string class. You just need to convert all external sources to internal encoding. I offer UTF-8 all the time, but especially on Linux because of my own support. I highly recommend putting your string literals in a message file to forget about problem (1) and deal only with problem (2).
I do not suggest using std::wstring for Linux, because 100% of the built-in APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t , you will need to convert to / from std::wstring without stopping and end up getting something wrong instead of doing everything slower (er).
If you are writing a Windows application, I would suggest the exact opposite, because all native APIs use the const wchar_t* signatures. The ANSI version of such functions performs internal conversion to / from const wchar_t* .
Some "portable" libraries / languages use different views on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I remember reading this trick in implementing Python references, but the article was pretty old. I am not sure if this is true.