Are there 6 octet UTF-8 sequences?

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I get conflicting standards. I need to support every Unicode character , not just in the range U + 0000..U + 10FFFF.

(All quotes from RFC 3629 )

Section 3:

In UTF-8, characters from the U + 0000..U + 10FFFF range (UTF-16 available range) are encoded using sequences of 1 to 4 octets. only the octet of the โ€œsequenceโ€ of one has a higher order bit set to 0, the remaining 7 bits are used to encode the character number. In a sequence of n octets, n> 1, the initial octet has n higher order bits set to 1, followed by a bit set to 0. The remaining bits of this octet contain bits of the number of characters to be encoded. All of the following octets (s): a higher order bit set to 1, and the next bit is 0, leaving 6 bits in each so that they contain a bit from the encoded character.

So, not all possible characters can be encoded using UTF-8? Does this mean that I cannot encode characters from different planes than BMP?

Section 2:

Octet values โ€‹โ€‹C0, C1, F5-FF never appear.

Does this mean that we cannot encode UTF-8 values โ€‹โ€‹with 5 or 6 octets (or even 4 that are not in the above range)?

Section 12:

The character range is limited to 0000-10FFFF (UTF-16 available range).

A look at the previous RFC confirms this ... they have reduced the range of characters.

Section 10:

Another security issue arises when encoding in UTF-8: ISO / IEC 10646 UTF-8 description allows encoding character numbers up to U + 7FFFFFFF, creating sequences of up to 6 bytes. Therefore, there is a risk of buffer overflows if the range of character numbers is not explicitly limited to U + 10FFFF or if the size of the buffer does not take into account the possibility of 5- and 6-byte sequences.

So, are these sequences allowed by the definition of ISO / IEC 10646, but not by the definition of RFC 3629? Which should I follow?

Thanks in advance.

+6
unicode utf-8
source share
3 answers

No Unicode characters beyond 10FFFF, BMP covers 0000 through FFFF.

UTF-8 is clearly defined for 0-10FFFF.

+7
source share

Both UTF-8 and UTF-16 allow you to encode all Unicode characters. What UTF-8 is not allowed to do is encode the upper and lower surrogate halves (which are used by UTF-16) or values โ€‹โ€‹above U + 10FFFF that are not legal Unicode.

Note that BMP ends in U + FFFF.

+1
source share

I would say no: Unicode code points are valid for the range [0, 0x10FFFFFF], and they are displayed in 1-4 octets. So, if you are faced with a 5- or 6-octet encoded code point UTF-8, this is not a valid code point - of course, nothing is assigned there. I am a little puzzled by why they are in the ISO standard - I could not find an explanation.

It is interesting, however, that perhaps someday in the future they will expand after U + 10FFFF. 0x10FFFF allows you to use more than a million characters, but there are many characters, and this will depend on how much will eventually be encoded. (For common sense, let's hope a million characters are many!) UTF-32 could handle more code points, and as you discovered, UTF-8 can. In fact, it would be UTF-16, which was unlucky - more surrogate pairs would be needed somewhere in the spectrum of code points.

0
source share

All Articles