UTF-8 and Unicode, what's with 0xC0 and 0x80?

Question

UTF-8 and Unicode, what's with 0xC0 and 0x80?

I read about Unicode and UTF-8 in the last couple of days, and I often come across a bitwise comparison like this:

int strlen_utf8(char *s) { int i = 0, j = 0; while (s[i]) { if ((s[i] & 0xc0) != 0x80) j++; i++; } return j; }

Can someone clarify the comparison with 0xc0 and check if this is the most significant bit?

Thank!

EDIT: ANDed, not a comparison, used the wrong word;)

+35

unicode utf-8

vdsf Oct 12 2018-10-12T00:

source share

1 answer

paxdiablo · Accepted Answer · 2010-10-12 03:51

This is not a comparison with 0xc0 , it is a logical operation AND with 0xc0 .

The 0xc0 mask 0xc0 is 11 00 00 00 , so what AND does is extracting only the top two bits:

  ab cd ef gh AND 11 00 00 00 -- -- -- -- = ab 00 00 00

It is then compared with 0x80 (binary 10 00 00 00 ). In other words, the if checks to see if the top two bits match 10 .

“Why?” I heard you ask. Well, that’s a good question. The answer is that in UTF-8, all bytes starting with bit pattern 10 are subsequent bytes of a multibyte sequence:

  UTF-8 Range Encoding Binary value ----------------- -------- -------------------------- U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx 10zzyyyy 10yyyyxx 10xxxxxx

So what this little fragment does is it goes through each byte of the string of your UTF-8 and counts all the bytes that are not continuation bytes (i.e. it gets the length of the string as declared). See this Wikipedia link for more details and Joel Spolsky is a great primer article .

Interestingly, by the way. You can classify bytes in a UTF-8 stream as follows:

With a high bit set to 0 , this is a single-byte value.
With the two most significant bits set to 10 , this is a continuation byte.
Otherwise, this is the first byte of the multibyte sequence, and the number of leading bits 1 indicates how many bytes in total for this sequence ( 110... means two bytes, 1110... means three bytes, etc.).

UTF-8 and Unicode, what's with 0xC0 and 0x80?

More articles: