Unicode: English characters above code point 127

I give a technical talk about Unicode and encoding in my company, in which I try to indicate that strings are always encoded , and developers should never carelessly assume that everything is 0-127 ASCII.

I have numerous examples of problems caused by incorrect text encoding, but I did not find an example of plain English text with numbers that were encoded above Unicode code point 127.

The main English alphabet is mapped in Unicode with the same numerical value as the regular old ASCII. The range of AZ mapped to [65-90] (or [0x41-0x5a] in hexadecimal format), and [az] mapped to [97-122] (hex [0x61-0x7a] ).

Does the English alphabet elsewhere on code charts? I do not mean rounded letters or other Latin options , just a simple English alphabet.

+4
source share
3 answers

CJK characters are generally monospaced in all fonts, as these languages ​​are usually spelled.

When mixing CJK and English characters, however, a problem arises: ASCII characters usually do not have a CJK character width. This means that if you use ASCII, you lose a monospace property, which may not always be desirable.

To this end, full-width characters (U + FF00-FFEE, Wikipedia , Unicode Code Chart ) can be used in place of “regular” characters. They have the property that they have the same width as a single CJK character.

Please note, however, that full-width characters are almost never used outside the CJK context, and even in these contexts regular ASCII is often used when the monospace is considered inconsequential.

+5
source

Many punctuation marks and characters have code point values ​​above U + 007F:

  • "Hello."
  • He was provided with a comprehensive Crayola box of sixty-four pencils, including gold and silver crayons, and did not allow me to look.
  • x ≠ y

The above examples use:

  • U + 201C and U + 201D - smart quotes
  • U + 2014 - em-dash
  • U + 2260 - not equal

See Unicode charts for more details.

+3
source

Well, if you just mean az and az , then no, English characters above 127. There are no words, for example, fiancé , resumé , etc. sometimes written in English and use code pages above 127.

Then various punctuation marks, currency symbols, etc. appear that are higher than 127. Not sure if this is considered plain English text.

+2
source

All Articles