Emacs 23 uses a character set four times that of Unicode - why?

From Emacs 23.1 NEWS :

*** The Emacs character set is now a superset of Unicode. (It has about four times the code space, there should be a lot).

And more details later:

*** In multibyte buffers and strings, characters are represented by UTF-8 byte sequences. The character code space is now 0x0..0x3FFFFF without a gap; code points 0x0..0x10FFFF Unicode characters of the same point code, while code points 0x3FFF80..0x3FFFFF are raw 8-bit bytes.

According to Wikipedia, BMP UCS has 65,536 characters, the latest version of Unicode contains more than 107,000 characters, and UCS has more than a million code points. 0x3FFFFF is over four million.

What problems can be solved, or how otherwise would it be useful to have an internal character set, which is a superset of Unicode?

+4
source share
1 answer

Unicode is designed to cover the necessary character sets for all human languages, which is certainly useful for globalizing / localizing your code, but since Emacs is a tool of the gods themselves, it should also cover every character that can be used by all deities (including, but not limited to the runes of the elven Great Ancients), space races (including, but not limited to our future alien overlords), super intellectual machine intelligences (including, but not limited to, and future robot masters) and any other creature that wants infinite cosmic force. This is potentially a lot of characters!

Or this may be due to the fact that UTF-8 is a character encoding method that has much more space than is occupied by the Unicode set, and Emacs just supports all of UTF-8, but I prefer my explanation above.

+23
source

All Articles