This is not a comparison with 0xc0 , it is a logical operation AND with 0xc0 .
The 0xc0 mask 0xc0 is 11 00 00 00 , so what AND does is extracting only the top two bits:
ab cd ef gh AND 11 00 00 00 -- -- -- -- = ab 00 00 00
It is then compared with 0x80 (binary 10 00 00 00 ). In other words, the if checks to see if the top two bits match 10 .
βWhy?β I heard you ask. Well, thatβs a good question. The answer is that in UTF-8, all bytes starting with bit pattern 10 are subsequent bytes of a multibyte sequence:
UTF-8 Range Encoding Binary value ----------------- -------- -------------------------- U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx 10zzyyyy 10yyyyxx 10xxxxxx
So what this little fragment does is it goes through each byte of the string of your UTF-8 and counts all the bytes that are not continuation bytes (i.e. it gets the length of the string as declared). See this Wikipedia link for more details and Joel Spolsky is a great primer article .
Interestingly, by the way. You can classify bytes in a UTF-8 stream as follows:
- With a high bit set to
0 , this is a single-byte value. - With the two most significant bits set to
10 , this is a continuation byte. - Otherwise, this is the first byte of the multibyte sequence, and the number of leading bits
1 indicates how many bytes in total for this sequence ( 110... means two bytes, 1110... means three bytes, etc.).
paxdiablo Oct 12 2018-10-12T00: 00Z
source share