Difference between MBCS and UTF-8 on Windows

Question

Difference between MBCS and UTF-8 on Windows

I am reading about charator set and encodings on Windows. I noticed that in the Visual Studio compiler there are two compiler flags (for C ++) called MBCS and UNICODE. What is the difference between the two? I don’t understand how UTF-8 is conceptually different from MBCS coding? Also, I found the following quote on MSDN :

Unicode is a 16-bit character encoding

This denies everything I read about Unicode. I thought that unicode could be encoded with various encodings such as UTF-8 and UTF-16. Can someone shed some more light on this confusion?

+50

windows unicode character-encoding mbcs

Naveen Jul 21 '10 at 11:11

source share

4 answers

_MBCS and _UNICODE are macros to determine which version of the TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor will map _tcsclen to a different version according to two macros: _MBCS and _UNICODE.

 _UNICODE & _MBCS Not Defined: strlen _MBCS Defined: _mbslen _UNICODE Defined: wcslen

To explain the difference between these string length counting functions, consider the following example.
If you have a computer box that runs a simplified Chinese version of Windows using GBK (code page 936), you compile the source file with the gbk encoding of the file and run it.

 printf("%d\n", _mbslen((const unsigned char*)"I爱你M")); printf("%d\n", strlen("I爱你M")); printf("%d\n", wcslen((const wchar_t*)"I爱你M"));

The result will be 4 6 3 .

Here is the hexadecimal representation of I爱你M in GBK.

 GBK: 49 B0 AE C4 E3 4D 00

_mbslen knows that this string is encoded in GBK, so it can correctly interpret the string and get the correct result 4 words: 49 as I , B0 AE as 爱 , C4 E3 as 你 , 4D as M

strlen only knows 0x00 , so it gets 6 .

wcslen believes that this hexdeciaml array is encoded in UTF16LE, and it reads two bytes in one word, so it gets the words 3 : 49 B0 , AE C4 , E3 4D .

as @xiaokaoy pointed out, the only valid terminator for wcslen is 00 00 . Thus, the result is not guaranteed to be 3 if the next byte is not 00 .

+13

Jichao Oct 22 '12 at 12:14

source share

MBCS stands for Multibyte Character Set and describes any character set in which a character is encoded (possibly) for more than 1 byte.

ANSI / ASCII character sets are not multibyte.

UTF-8 , however, is a multibyte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).

However, UTF-8 is just one of several possible specific Unicode character set encodings. Notably, UTF-16 is different, and it happens to be the encoding used by Windows / .NET (IIRC). Here is the difference between UTF-8 and UTF-16:

UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.

Therefore, it is not true that Unicode is a 16-bit character encoding. It is rather something like a 21-bit encoding (or even more these days), since it includes a character set with code points U+000000 to U+10FFFF .

+11

stakx Jul 21 '10 at 11:17

source share

As a note to other answers, MSDN has a Common Text Matching document in TCHAR.H with handy tables that show how _UNICODE and _MBCS preprocessor instructions change the definition of various C / C ++ types.

Regarding the wording of “Unicode” and “Multi-Byte Character Set”, people have already described what effects are. I just want to emphasize that both of them - Microsoft - talk about some very specific things. (That is, they mean something less general and more specific to Windows than would be expected if it came from a non-Microsoft understanding of the internationalization of text.) These exact phrases appear and tend to get their own separate sections / subsections of technical Microsoft docs like text and lines in Visual C ++

+4

Chris May 12 '13 at 1:21

source share

dan04 · Accepted Answer · 2010-07-21 13:42

I noticed that there are two compiler flags in the Visual Studio compiler (for C ++) called MBCS and UNICODE. What is the difference between the two?

Many functions in the Windows API come in two versions: one that accepts char parameters (on a locale-specific code page), and one that accepts wchar_t parameters (in UTF-16).

 int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType); int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

Each of these pairs of functions also has a macro without a suffix, which depends on whether the UNICODE macro is defined.

 #ifdef UNICODE #define MessageBox MessageBoxW #else #define MessageBox MessageBoxA #endif

To do this, the TCHAR type TCHAR defined to abstract the character type used by the API functions.

 #ifdef UNICODE typedef wchar_t TCHAR; #else typedef char TCHAR; #endif

This, however, was a bad idea . You should always explicitly specify the type of character.

I don’t understand how UTF-8 is conceptually different from MBCS encoding?

MBCS stands for Multibyte Character Set. For literal thinking, it seems that UTF-8 will qualify.

But on Windows, "MBCS" refers only to the character encoding that can be used with the "A" version of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987) and 950 (Big5), but NOT UTF-8.

To use UTF-8, you need to convert the string to UTF-16 using MultiByteToWideChar , call the version of the W function, and call WideCharToMultiByte in the output. In fact, this is what actually performs the “A” function, which makes me wonder why Windows doesn't just support UTF-8 .

This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" function .

Unicode is a 16-bit character encoding
This denies everything I read about Unicode.

MSDN is wrong. Unicode is a 21-bit encoded character set that has several encodings, the most common of which are UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings such as GB18030, UTF-7, and UTF-EBCDIC.)

Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early Unicode sequence, when 16 bits were considered enough for everyone, and UTF-8 was used only on Plan 9. Thus, UCS-2 was Unicode.

Difference between MBCS and UTF-8 on Windows

More articles: