Convert QString to QByteArray encoded in UTF-8 or Latin1

I would like to hide QString in either utf8 or latin1 QByteArray, but today I get everything as utf8.

And I am testing this with some char in the higher latin1 segment above 0x7f, where the German example is a good example.

If I like it:

QString name("\u00fc"); // U+00FC = ü QByteArray utf8; utf8.append(name); qDebug() << "utf8" << name << utf8.toHex(); QByteArray latin1; latin1.append(name.toLatin1()); qDebug() << "Latin1" << name << latin1.toHex(); QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1"); QByteArray encodedString = codec->fromUnicode(name); qDebug() << "ISO 8859-1" << name << encodedString.toHex(); 

I get the following output.

 utf8 "ü" "c3bc" Latin1 "ü" "c3bc" ISO 8859-1 "ü" "c3bc" 

As you can see, I get unicode 0xc3bc wherever I would expect to get Latin1 0xfc for steps 2 and 3.

I assume that I should get something like this:

 utf8 "ü" "c3bc" Latin1 "ü" "fc" ISO 8859-1 "ü" "fc" 

What's going on here?

/Thanks


Links to some character tables:


This code was created and executed on a system based on Ubuntu 10.04.

 $> uname -a Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux $> env | grep LANG LANG=en_US.utf8 

And if I try to use

 utf8.append(name.toUtf8()); 

I get this conclusion

 utf8 "ü" "c383c2bc" Latin1 "ü" "c3bc" ISO 8859-1 "ü" "c3bc" 

So latin1 is unicode and utf8 is double coding ...

Does it depend on some system settings?


If I ran this (could not get .name () to build)

 qDebug() << "system name:" << QLocale::system().name(); qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings(); qDebug() << "codecForLocale:" << QTextCodec::codecForLocale()->name(); 

Then I get the following:

 system name: "en_US" codecForCStrings: 0x0 codecForLocale: "System" 

Decision

If I point out that this is UTF-8, I use it so that different classes know about it, then it works.

 QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8")); QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8")); qDebug() << "system name:" << QLocale::system().name(); qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name(); qDebug() << "codecForLocale:" << QTextCodec::codecForLocale()->name(); QString name("\u00fc"); QByteArray utf8; utf8.append(name); qDebug() << "utf8" << name << utf8.toHex(); QByteArray latin1; latin1.append(name.toLatin1()); qDebug() << "Latin1" << name << latin1.toHex(); QTextCodec *codec = QTextCodec::codecForName("latin1"); QByteArray encodedString = codec->fromUnicode(name); qDebug() << "ISO 8859-1" << name << encodedString.toHex(); 

Then I get this output:

 system name: "en_US" codecForCStrings: "UTF-8" codecForLocale: "UTF-8" utf8 "ü" "c3bc" Latin1 "ü" "fc" ISO 8859-1 "ü" "fc" 

And it looks like this.

+6
c ++ qt utf-8 latin1 qbytearray
source share
1 answer

What you need to know:

  • character execution page

There is something called an executable character set in the C ++ standard, which is a term that describes that the output of string and character literals will be in a binary expression generated by the compiler. You can read about this in the 1.1 Character sets in subsection 1 Overview in the C Preprocessor Guide at http://gcc.gnu.org .

Question:
What will be created as a result of the "\u00fc" string literal?

Answer:
It depends on what the set of execution characters is. In the case of gcc (this is what you are using) it is by default UTF-8 unless you specify something else with the -fexec-charset option. You can read about this and other parameters that control the preprocessing phase in 3.11. Options Preprocessor control A unit of the 3 parameters of the GCC commands in the GCC Handbook at http://gcc.gnu.org . Now that we know that the character set is UTF-8, we know that "\u00fc" will be converted to the UTF-8 encoding of the Unicode U+00FC code point, which is a sequence of two bytes 0xc3 0xbc .

The QString constructor that takes char * calls QString QString::fromAscii ( const char * str, int size = -1 ) , which uses a set of codecs with void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) (if any codec was installed) or does the same as QString QString::fromLatin1 ( const char * str, int size = -1 ) (in the absence of a codec).

Question:
What codec will the QString constructor use to decode the two byte sequences ( 0xc3 0xbc ) it receives?

Answer:
By default, no codec is installed with QTextCodec::setCodecForCStrings() , so Latin1 will be used to decode a sequence of bytes. Since 0xc3 and 0xbc are valid in Latin 1, representing à and ¼, respectively (this should already be familiar to you, since it was taken directly from this answer to your previous question), we get a QString with these two characters.

You should not use the QDebug class to output anything outside of ASCII . You have no guarantee what you get.

Testing program:

 #include <QtCore> void dbg(char const * rawInput, QString s) { QString codepoints; foreach(QChar chr, s) { codepoints.append(QString::number(chr.unicode(), 16)).append(" "); } qDebug() << "Input: " << rawInput << ", " << "Unicode codepoints: " << codepoints; } int main(int argc, char *argv[]) { QCoreApplication app(argc, argv); qDebug() << "system name:" << QLocale::system().name(); for (int i = 1; i <= 5; ++i) { switch(i) { case 1: qDebug() << "\nWithout codecForCStrings (default is Latin1)\n"; break; case 2: qDebug() << "\nWith codecForCStrings set to UTF-8\n"; QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8")); break; case 3: qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n"; QTextCodec::setCodecForCStrings(0); QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8")); break; case 4: qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n"; QTextCodec::setCodecForCStrings(0); QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1")); break; } qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings() ? QTextCodec::codecForCStrings()->name() : "NOT SET"); qDebug() << "codecForLocale:" << (QTextCodec::codecForLocale() ? QTextCodec::codecForLocale()->name() : "NOT SET"); qDebug() << "\n1. Using QString::QString(char const *)"; dbg("\\u00fc", QString("\u00fc")); dbg("\\xc3\\xbc", QString("\xc3\xbc")); dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü")); qDebug() << "\n2. Using QString::fromUtf8(char const *)"; dbg("\\u00fc", QString::fromUtf8("\u00fc")); dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc")); dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü")); qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)"; dbg("\\u00fc", QString::fromLocal8Bit("\u00fc")); dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc")); dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü")); } return app.exec(); } 

Exit to mingw 4.4.0 on Windows XP:

 system name: "pl_PL" Without codecForCStrings (default is Latin1) codecForCStrings: "NOT SET" codecForLocale: "System" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "102 13d " Input: \xc3\xbc , Unicode codepoints: "102 13d " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " With codecForCStrings set to UTF-8 codecForCStrings: "UTF-8" codecForLocale: "System" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "102 13d " Input: \xc3\xbc , Unicode codepoints: "102 13d " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8 codecForCStrings: "NOT SET" codecForLocale: "UTF-8" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1 codecForCStrings: "NOT SET" codecForLocale: "ISO-8859-1" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " codecForCStrings: "NOT SET" codecForLocale: "ISO-8859-1" 1. Using QString::QString(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 2. Using QString::fromUtf8(char const *) Input: \u00fc , Unicode codepoints: "fc " Input: \xc3\xbc , Unicode codepoints: "fc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd " 3. Using QString::fromLocal8Bit(char const *) Input: \u00fc , Unicode codepoints: "c3 bc " Input: \xc3\xbc , Unicode codepoints: "c3 bc " Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc " 

I want to thank thiago , cbreak , peppe and heinz from the #qt freenode.org IRC channel for showing and helping to understand the problems associated with this.

+8
source share

All Articles