Thus, the brilliant concept of UTF-8 was invented. UTF-8 was another system for storing your Unicode code point string, these U + magic numbers in memory using 8-bit bytes. In UTF-8, each code point from 0 to 127 is stored in one byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has a neat side effect that the English text looks exactly the same in UTF-8 as in ASCII, so the Americans donβt even notice anything bad. Only the rest of the world should jump through hoops. In particular, Hello, which was U + 0048 U + 0065 U + 006C U + 006C U + 006F, will be saved as 48 65 6C 6C 6F, which, lo! the same as in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so brave to use accented letters or Greek letters or Klingon letters, you will need to use several bytes to store one code point, but Americans will never notice. (UTF-8 also has a nice feature, which is an ignorant old string processing code that wants to use one 0 bytes, since a null delimiter will not trim the lines).
So far I have told you three ways to encode Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still need to find out if it is a high-endian UCS - 2 or low level UCS-2. And there, the popular new UTF-8 standard, which has an excellent property, also works worthy if you have a happy coincidence of English texts and Braindead programs that do not know at all that there is something other than ASCII.
In fact, there are many other Unicode encoding methods. There is something called UTF-7, which is very similar to UTF-8, but guarantees that the high bit will always be zero, so if you need to go Unicode through some kind of draconian police email system that thinks 7 bits enough, thanks, it can still squeeze through unscathed. There UCS-4, which stores each code point in 4 bytes, which has the good property that each individual code point can be stored in the same number of bytes, but, golly, even Texans would not be so brave about waste so much memory .
And actually now, when you think of things in terms of Platonic perfect letters, which are represented by Unicode code points, these Unicode code points can be encoded in any old school coding scheme too! For example, you can encode the Unicode string for Hello (U + 0048 U + 0065 U + 006C U + 006C U + 006F) to ASCII or the old Greek OEM encoding or ANSI Hebrew encoding or any of several hundred encodings that have been invented so far in one catch: some of the letters may not appear! If there is no equivalent for the Unicode code code that you are trying to represent in the encoding you are trying to represent, you usually get a small question mark :? or, if you are really good, a box. What did you get? β
There are hundreds of traditional encodings that can only store some code points and change all other code points to question marks. Some popular text encodings in English are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Jewish letters in these encodings, and you get a bunch of question marks. UTF 7, 8, 16 and 32 have the nice property of properly storing any code point.