What you need to know:
There is something called an executable character set in the C ++ standard, which is a term that describes that the output of string and character literals will be in a binary expression generated by the compiler. You can read about this in the 1.1 Character sets in subsection 1 Overview in the C Preprocessor Guide at http://gcc.gnu.org .
Question:
What will be created as a result of the "\u00fc" string literal?
Answer:
It depends on what the set of execution characters is. In the case of gcc (this is what you are using) it is by default UTF-8 unless you specify something else with the -fexec-charset option. You can read about this and other parameters that control the preprocessing phase in 3.11. Options Preprocessor control A unit of the 3 parameters of the GCC commands in the GCC Handbook at http://gcc.gnu.org . Now that we know that the character set is UTF-8, we know that "\u00fc" will be converted to the UTF-8 encoding of the Unicode U+00FC code point, which is a sequence of two bytes 0xc3 0xbc .
The QString constructor that takes char * calls QString QString::fromAscii ( const char * str, int size = -1 ) , which uses a set of codecs with void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) (if any codec was installed) or does the same as QString QString::fromLatin1 ( const char * str, int size = -1 ) (in the absence of a codec).
Question:
What codec will the QString constructor use to decode the two byte sequences ( 0xc3 0xbc ) it receives?
Answer:
By default, no codec is installed with QTextCodec::setCodecForCStrings() , so Latin1 will be used to decode a sequence of bytes. Since 0xc3 and 0xbc are valid in Latin 1, representing à and ¼, respectively (this should already be familiar to you, since it was taken directly from this answer to your previous question), we get a QString with these two characters.
You should not use the QDebug class to output anything outside of ASCII . You have no guarantee what you get.
Testing program:
#include <QtCore> void dbg(char const * rawInput, QString s) { QString codepoints; foreach(QChar chr, s) { codepoints.append(QString::number(chr.unicode(), 16)).append(" "); } qDebug() << "Input: " << rawInput << ", " << "Unicode codepoints: " << codepoints; } int main(int argc, char *argv[]) { QCoreApplication app(argc, argv); qDebug() << "system name:" << QLocale::system().name(); for (int i = 1; i <= 5; ++i) { switch(i) { case 1: qDebug() << "\nWithout codecForCStrings (default is Latin1)\n"; break; case 2: qDebug() << "\nWith codecForCStrings set to UTF-8\n"; QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8")); break; case 3: qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n"; QTextCodec::setCodecForCStrings(0); QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8")); break; case 4: qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n"; QTextCodec::setCodecForCStrings(0); QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1")); break; } qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings() ? QTextCodec::codecForCStrings()->name() : "NOT SET"); qDebug() << "codecForLocale:" << (QTextCodec::codecForLocale() ? QTextCodec::codecForLocale()->name() : "NOT SET"); qDebug() << "\n1. Using QString::QString(char const *)"; dbg("\\u00fc", QString("\u00fc")); dbg("\\xc3\\xbc", QString("\xc3\xbc")); dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü")); qDebug() << "\n2. Using QString::fromUtf8(char const *)"; dbg("\\u00fc", QString::fromUtf8("\u00fc")); dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc")); dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü")); qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)"; dbg("\\u00fc", QString::fromLocal8Bit("\u00fc")); dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc")); dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü")); } return app.exec(); }
Exit to mingw 4.4.0 on Windows XP:
system name: "pl_PL" Without codecForCStrings (default is Latin1) codecForCStrings: "NOT SET" codecForLocale: "System" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "102 13d " Input: \xc3\xbc , Unicode codepoints: "102 13d " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " With codecForCStrings set to UTF-8 codecForCStrings: "UTF-8" codecForLocale: "System" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "102 13d " Input: \xc3\xbc , Unicode codepoints: "102 13d " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8 codecForCStrings: "NOT SET" codecForLocale: "UTF-8" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1 codecForCStrings: "NOT SET" codecForLocale: "ISO-8859-1" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " codecForCStrings: "NOT SET" codecForLocale: "ISO-8859-1" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
I want to thank thiago , cbreak , peppe and heinz from the #qt freenode.org IRC channel for showing and helping to understand the problems associated with this.