If UTF-8 is 8-bit encoding, why does it need 4 bytes?

On a unicode site it is written that UTF-8 can be represented by 1-4 bytes. As I understand from this question, https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bit encoding. So what's the truth? If it's 8-bit encoding, then what's the difference between ASCII and UTF-8? If this is not so, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?

+7
source share
3 answers

Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!) Joel Spolsky - Wednesday, October 08, 2003

Shutter Speed ​​Top:

Thus, the brilliant concept of UTF-8 was invented. UTF-8 was another system for storing your Unicode code point string, these U + magic numbers in memory using 8-bit bytes. In UTF-8, each code point from 0 to 127 is stored in one byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has a neat side effect that the English text looks exactly the same in UTF-8 as in ASCII, so the Americans don’t even notice anything bad. Only the rest of the world should jump through hoops. In particular, Hello, which was U + 0048 U + 0065 U + 006C U + 006C U + 006F, will be saved as 48 65 6C 6C 6F, which, lo! the same as in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so brave to use accented letters or Greek letters or Klingon letters, you will need to use several bytes to store one code point, but Americans will never notice. (UTF-8 also has a nice feature, which is an ignorant old string processing code that wants to use one 0 bytes, since a null delimiter will not trim the lines).

So far I have told you three ways to encode Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still need to find out if it is a high-endian UCS - 2 or low level UCS-2. And there, the popular new UTF-8 standard, which has an excellent property, also works worthy if you have a happy coincidence of English texts and Braindead programs that do not know at all that there is something other than ASCII.

In fact, there are many other Unicode encoding methods. There is something called UTF-7, which is very similar to UTF-8, but guarantees that the high bit will always be zero, so if you need to go Unicode through some kind of draconian police email system that thinks 7 bits enough, thanks, it can still squeeze through unscathed. There UCS-4, which stores each code point in 4 bytes, which has the good property that each individual code point can be stored in the same number of bytes, but, golly, even Texans would not be so brave about waste so much memory .

And actually now, when you think of things in terms of Platonic perfect letters, which are represented by Unicode code points, these Unicode code points can be encoded in any old school coding scheme too! For example, you can encode the Unicode string for Hello (U + 0048 U + 0065 U + 006C U + 006C U + 006F) to ASCII or the old Greek OEM encoding or ANSI Hebrew encoding or any of several hundred encodings that have been invented so far in one catch: some of the letters may not appear! If there is no equivalent for the Unicode code code that you are trying to represent in the encoding you are trying to represent, you usually get a small question mark :? or, if you are really good, a box. What did you get? β†’

There are hundreds of traditional encodings that can only store some code points and change all other code points to question marks. Some popular text encodings in English are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Jewish letters in these encodings, and you get a bunch of question marks. UTF 7, 8, 16 and 32 have the nice property of properly storing any code point.

+15
source

The 8-bit encoding means that individual encoding bytes use 8 bits. In contrast, pure ASCII is 7-bit encoding, since it only has codes 0-127. It used to be that the software had problems with 8-bit encodings; One of the reasons for Base-64 and uuencode encoding was to receive binary data through email systems that did not handle 8-bit encodings. However, this was a decade or more, as it ceased to be valid as a problem - the software had to be 8-bit clean or capable of handling 8-bit encodings.

Unicode itself is a 21-bit character set. There are a number of encodings for it:

  • UTF-32, where each Unicode code point is stored in a 32-bit integer
  • UTF-16, where many Unicode code points are stored in a single 16-bit integer, but some require two 16-bit integers (therefore, Unicode code points require 2 or 4 bytes).
  • UTF-8, where Unicode code points may require 1, 2, 3, or 4 bytes to store one Unicode code point.

So, "UTF-8 can be represented by 1-4 bytes" is probably not the most suitable way to formulate it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.

+12
source

UTF-8 is 8-bit variable-width encoding. The first 128 characters in Unicode, if they are encoded in UTF-8, are represented as characters in ASCII.

To further understand this, Unicode treats characters as code points - a prime number that can be represented in several ways (encodings). UTF-8 is one such encoding. It is most often used because it gives the best characteristics of space consumption among all encodings. If you store characters from the ASCII character set in UTF-8 encoding, then the encoded UTF-8 data takes up the same space. This allowed applications that previously used ASCII to move seamlessly (well, not really, but it didn’t lead to something like Y2K) in Unicode, because the character representations are the same.

I will leave this excerpt from RFC 3629 on how UTF-8 encoding will work:

Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 

You will notice why the encoding will lead to characters occupying from 1 to 4 bytes (right column) for different ranges of characters in Unicode (left column).

UTF-16, UTF-32, UCS-2, etc. will use different coding schemes, where the code points will be represented as 16-bit or 32-bit codes instead of the 8-bit codes that UTF-8 does.

+10
source

All Articles