QString - Is UTF8 embedded in 16 bit?

I'm very surprised. I started digging into QString::data() , trying to help another specialist here with the QString and ASCII problem.

I made the following code snippet that looks at every 16-bit QString data packet and found that letters like "À" and "ß" seem to be encoded in UTF-8, but use 16 bits to store 8 bits. Of course they can do what they like, but docs say that QString will be in UTF-16. But it looks different.

Fix: QTring doc Qt 4.8 does not actually mention UTF-16. But he also does not claim that UTF-8 is used with 16 bits.

Please can someone enlighten me !?

My code is:

 QString h("AßB"); char * pt = (char*)h.data(); for(int i = 0; ;i+=2) { // get 16bit value u_int16_t s = *(u_int16_t*)(pt + i); // break condition if(s == 0) break; qDebug() << i << s << QChar(s) << h.size(); } 

And what qDebug () tells me:

 0 65 'A' 4 2 195 'Γƒ' 4 4 159 '' 4 6 66 'B' 4 

Note that 'ß' is apparently encoded in UTF-8 encoding, still using 16 bits for both parts of the encoding.

195 159 - encoding UTF-8 'ß'.

My char map states that the UTF-16 representation should be 0x00DF for 'ß'. And this is what I was hoping to get.

Also note that QString::size() reports a questionable size of 4 instead of 3.

+6
source share
1 answer

QString data is stored inside Unicode. From the qt docs:

 QString str = "Hello"; 

"QString converts const char * data to Unicode using the fromUtf8 () function.

Here's the link: QString Class

Strange, I do not see any toUTF16 () method; although he has toUTF8.

In addition, UTF-16 is not Unicode:

"The Unicode standard encodes characters in the range U + 0000..U + 10FFFF, which corresponds to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16 or UTF-32), each character will be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code blocks, or one 32-bit code.

From: Frequently Asked Questions - UTF-8, UTF-16, UTF-32 and Specification

Edit:

I know that MSVC was used to compile both Unicode and non-Unicode assemblies. From M $:

"Unicode UTF-16 encoding

Represents Unicode characters as sequences of 16-bit integers. Your application can use the UnicodeEncoding class to convert characters to and from UTF-16 encoding.

UTF-16 is often used natively, as in the Microsoft.Net char type, such as Windows WCHAR and other common types. The most common Unicode code points accept only one UTF-16 code point (2 bytes). Additional Unicode characters U + 10000 and higher require two more UTF-16 surrogate codes.

Found on .NET Framework 3.5 - Using Unicode Encoding .

Thus, M $ IS often uses UTF-16 internally. Unicode is a 21-bit long list of characters, and there are various UTF formats for encoding them.

How does this affect Ubuntu? M $ encodes things inside UTF-16 and calls them Unicode.

Frank Osterfeld obviously found a problem in your code: that the compiler used the encoding of the source file to create a string literal. Bizarrely, it uses 16-bit encoding and fits the value of UTF-8; thereby coming up with the wrong sequence of characters! I wonder if it will be β€œA” with an umlaut above it if you type QString. It may be converted back to the same UTF-8 before you see it, although the compiler obviously does not understand it. Frank and you, were able to prove that on Ubuntu Qt uses UTF-16. It seems that once upon a time in Unicode encoded data, Unicode characters were almost never found, and not encoded (i.e. 21 bits). "Unicode" is obviously UTF-16 builds.
0
source

All Articles