Why does UTF-8 use more than one byte to represent some characters?

I recently looked at an article on character encoding. I am concerned about the specific point mentioned there.

In the first figure, the author shows the characters, their code points in different character sets, and how they are encoded in different encoding formats. For example, code point Γ© is equal to E9 . Encoding ISO-8859-1 it is represented as E9 . In UTF-16 it is represented as 00 E9 . But in UTF-8 it is represented using 2 bytes, C3 A9 .

My question is, why is this required? It can be represented by 1 byte. Why are two bytes used? Could you tell me?

+4
source share
3 answers

UTF-8 uses 2 high bits (bit 6 and bit 7) to indicate if there are a few more bytes: only low 6 bits are used for the actual character data. This means that any character above 7F requires (at least) 2 bytes.

+7
source

One byte can contain only 256 different values.

This means that an encoding representing each character as a single byte, such as ISO-8859-1, cannot encode more than 256 different characters. That is why you cannot use ISO-8859-1 to correctly write Arabic, Japanese, or many other languages. Only limited space is available, and it is already in use by other characters.

UTF-8, on the other hand, should be able to display all millions of characters in Unicode. This makes it impossible to compress each individual character into one byte.

UTF-8 developers decided to make all ASCII characters (U + 0000 to U + 007F) representable with one byte and demanded that all other characters be stored as two or more bytes. If they decided to specify more characters in a single-byte representation, the encodings of the other characters would be longer and more complex.

If you want a visual explanation of why bytes above 7F do not represent the corresponding characters of 8859-1, look at the UTF-8 encoding block table on Wikipedia . You will see that each byte value outside the ASCII range either already has a value or is illegal for historical reasons. There is simply no place in the table for bytes to represent their 8859-1 equivalents, and providing additional byte values ​​may violate several important properties of UTF-8.

+11
source

Since for many languages ​​this is 2-bit encoding, which is simply not enough to encode all letters of all alphabets. See 2-bit encoding 00 .. FF 15 ^ 2 = 255 characters 4 bits 0000 ... FFFF 4 ^ 15 = 50625

-4
source

All Articles