Difficulties inherent in ASCII and extended ASCII, and Unicode compatibility?

What are the difficulties inherent in ASCII and Extended ASCII, and how are these difficulties overcome by Unicode?

Can someone explain me Unicode compatibility?

And what do Unicode-related terms mean, such as Planes, Basic Multilingual Plane (BMP), Suplementary Multilingual Plane (SMP), Suplementary Ideographic Plane (SIP), Advanced Special Plane (SSP), and Private Usage Plans (PUP).

I found all these words very confusing

+6
unicode character-encoding ascii extended-ascii
source share
1 answer

Ascii

ASCII was less than or greater than the first character encoding. At an age when the byte was very expensive and 1 MHz was extremely fast, only the characters that appeared on those ancient US Extended ASCII typewriters, which contain space for 255 characters. Most of the remaining room is used by special characters, such as diacritics and line drawing characters. But since everyone used the remaining room in their own way (IBM, Commodore, universities, organizations, etc.), it was not interchangeable . Characters that were originally encoded using X encoding will display as Mojibake when they are decoded using a different Y encoding. Later, ISO came up with a standard character encoding definition for 8-bit ASCII extensions, resulting in the well-known ISO 8859 character encoding standards based on ASCII, such as ISO 8859-1, so that everything is better interchangeably.

Unicode

8 bits may be enough for languages ​​that use the Latin alphabet, but this, of course, is not enough for the rest of the non-Latin languages ​​in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etc., if only to include them in 8 bits. They developed their own character encodings other than ISO , which were irrelevant , such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etc. Finally, a new character encoding standard was established based on the ISO 8859-1 standard to cover any characters used in the world so that it is interchangeable everywhere : Unicode . It contains more than a million characters, of which about 10% is currently filled. UTF-8 character encoding is based on Unicode.

Unicode Plans

Unicode characters are classified in seventeen planes , each of which provides 65,536 characters (16 bits).

  • Plane 0: Basic Multilingual Plane (BMP) , contains the characters of all modern languages ​​known in the world.
  • Plane 1: Suplementary Multilingual Plane (SMP) , contains historical languages ​​/ scripts, as well as multilingual musical and mathematical symbols.
  • Plane 2: Elementary Ideographic Plane (SIP) , it contains the "special" CJK (Chinese / Japanese / Korean) characters, of which there are quite a few, but very rarely used in modern writing. "Normal" CJK characters are already present in BMP.
  • Aircraft 3-13: not used.
  • Aircraft 14: Optional Special Aircraft (SSP) because it only contains some tags and glyph variation selectors. Tag marks are currently out of date and may be removed in the future. Glyph variation selectors should be used as the kind of metadata that you add to existing characters, which in turn can encourage the reader to give the character a small different glyph.
  • Aircraft 15-16: Private Aircraft (PUP) , this enables (large) organizations or user initiatives to include their own special characters or characters in the standard so that it is interchangeable throughout. For example Emoji (emoticons / emotions in Japanese style).

Usually you are only interested in BMP and UTF-8 encoding as the standard character encoding throughout the application.

+11
source share

All Articles