Codepage terminology and concepts

I study code pages and come across many conflicting uses of terminology, even among the various Wikipedia entries. I just can’t find a source of information that explains the whole process of processing characters from beginning to end. Can someone well versed in this area suggest ways in which the following information is inaccurate or incorrect:

The process of representing a character, as I understand it:

  • We start with character sets (not sure of the correct terminology here, possibly “scripts”) that are not related to any particular platform. For example, "Cyrillic alphabet" refers to the same object in the context of Windows, for example, in Linux.

  • The members of these sets are usually selected in bundles by suppliers to form a character set for a particular platform. The platform can assign these various codes, such as GDI values ​​on Windows (for example, 0 for ANSI_CHARSET and other codes mentioned here: http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes ) I cannot find much information about these sets, for example, whether they are actually encoded character sets or if they are simply disordered and abstract.

  • Separate code pages are developed from these sets, which appear to have a one-to-one mapping to GDI values. Since these GDI values ​​are platform-specific groups, does this mean that Windows code pages are essentially an encoded version of each individual set?

I'm having trouble matching this idea with a link shown to me earlier (which I lost) that showed a one to many mapping between these GDI encodings and code pages on different platforms. How accurate is this, these GDI values ​​indicate the settings from which different code pages can be developed on different platforms?

  • Each code page maps a member of an abstract character set to an integer to represent its position in the set. In the case of the “simpler” code pages mentioned on the above web page, they can be attributed to the use of the more precise term “character card”. Is this term appropriate to consider or is it too subtle and unimportant distinction?

  • A font resolves a code point to a glyph if it contains one for that code point, otherwise it reports an error. I also read that a font can return its own empty glyph for those code points that it does not support. Can an application distinguish between this empty glyph and successful resolution, i.e. Does the font return a kind error code with this empty glyph?

I find the degree of my confusion. Any clarification in this regard would be invaluable. Thanks in advance.

+4
source share
2 answers

You are essentially right:

  • Start with the number of known characters.
  • Select a subset of these characters (character set)
  • Match them with bit patterns (codepage and encoding).
  • Assign them to an output device by combining a character with a glyph (i.e. using a font, a bit pattern, and a code page / encoding that displays the bit pattern for the character).

All platforms have similar code pages. And even on many code pages, there are similar value comparisons for a character. For example, the Latin characters are Windows, Mac Roman, and unicode for the first 127 values. There is standardization (for example, http://en.wikipedia.org/wiki/Shift_JIS for the Japanese language) of code pages so that machines can interact.

As a rule, for new development you should use unicode encoding with one of the popular encodings. UTF8 is popular in most modern systems. UTF16LE is used for Windows system calls ending in W.

0
source
0
source

All Articles