Ascii
ASCII was less than or greater than the first character encoding. At an age when the byte was very expensive and 1 MHz was extremely fast, only the characters that appeared on those ancient US Extended ASCII typewriters, which contain space for 255 characters. Most of the remaining room is used by special characters, such as diacritics and line drawing characters. But since everyone used the remaining room in their own way (IBM, Commodore, universities, organizations, etc.), it was not interchangeable . Characters that were originally encoded using X encoding will display as Mojibake when they are decoded using a different Y encoding. Later, ISO came up with a standard character encoding definition for 8-bit ASCII extensions, resulting in the well-known ISO 8859 character encoding standards based on ASCII, such as ISO 8859-1, so that everything is better interchangeably.
Unicode
8 bits may be enough for languages ββthat use the Latin alphabet, but this, of course, is not enough for the rest of the non-Latin languages ββin the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etc., if only to include them in 8 bits. They developed their own character encodings other than ISO , which were irrelevant , such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etc. Finally, a new character encoding standard was established based on the ISO 8859-1 standard to cover any characters used in the world so that it is interchangeable everywhere : Unicode . It contains more than a million characters, of which about 10% is currently filled. UTF-8 character encoding is based on Unicode.
Unicode Plans
Unicode characters are classified in seventeen planes , each of which provides 65,536 characters (16 bits).
- Plane 0: Basic Multilingual Plane (BMP) , contains the characters of all modern languages ββknown in the world.
- Plane 1: Suplementary Multilingual Plane (SMP) , contains historical languages ββ/ scripts, as well as multilingual musical and mathematical symbols.
- Plane 2: Elementary Ideographic Plane (SIP) , it contains the "special" CJK (Chinese / Japanese / Korean) characters, of which there are quite a few, but very rarely used in modern writing. "Normal" CJK characters are already present in BMP.
- Aircraft 3-13: not used.
- Aircraft 14: Optional Special Aircraft (SSP) because it only contains some tags and glyph variation selectors. Tag marks are currently out of date and may be removed in the future. Glyph variation selectors should be used as the kind of metadata that you add to existing characters, which in turn can encourage the reader to give the character a small different glyph.
- Aircraft 15-16: Private Aircraft (PUP) , this enables (large) organizations or user initiatives to include their own special characters or characters in the standard so that it is interchangeable throughout. For example Emoji (emoticons / emotions in Japanese style).
Usually you are only interested in BMP and UTF-8 encoding as the standard character encoding throughout the application.
Balusc
source share