Probably time for reanalysis, also considering XML 1.1.
What character code breakpoints exist in Unicode?
U+0000 to U+001f , inherited from ASCII.U+007F , inherited from ASCIIU+0080 to U+009F , inherited from Latin-1- various special-purpose ranges, standardized explicitly for Unicode, and are mostly useful, especially in non-markup contexts. They are discussed here in blocks, including reasons why and how to use them, or not to use them in XML, and what to do if you come across them anyway.
How does XML view these control characters?
This is a different classification.
- Tab and new line (regardless of the platform’s dependence on the new line). Everyone uses them. Everyone knows what they should stand for. It is allowed in almost all known forms, often even for fairly printed printing of the markup itself.
U+0000 is evil. Zero character? String terminator? Binary noise? The opposite of both interoperability and markup. It is forbidden in all forms.- Anything else? Underutilized, problematic interoperability, but there are ways to tolerate them without even knowing that they should be “controlled”.
Now we turn our attention only to this last category, actually to control codes. That is, the following summary does NOT apply to tabs and newlines: U+0009 , U+000a , U+000D , U+0085 , U+2028 .
XML 1.0 allows you to use all of the above ranges of control characters, except U+0000 to U+001f , as text (directly included characters), and even those (except for evil U+0000 ) are allowed as numeric symbolic links . U+007F to U+009F was clearly inaction, and this inconsistency was fixed in XML 1.1, but vice versa. They even gave a detailed explanation inside the standard:
Finally, there is considerable demand for defining the standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows you to use references to C # x1 to # x1F control characters, most of which are prohibited in XML 1.0. However, for reliability reasons, these characters still cannot be used directly in documents. To increase the reliability of character encoding detection, additional control characters # x7F through # x9F, which were freely permitted in XML 1.0 documents, should now also be displayed only as symbol references. (Simple characters, of course, are freed.) A small sacrifice of backward compatibility is considered negligible. Due to potential problems with the API, # x0 is still forbidden both directly and as a symbol reference.
Why do Unicode and XML allow markup-like characters to be used free of charge, except for a few "inherited" ranges? People should use markup for them.
Unicode is also used in non-markup contexts, and it is still an evolving character set. It would be too difficult to implement an appropriate XML processor if the set of uncontrolled characters was a moving target.
OK, what's wrong with inherited ranges compared to the Unicode control character characters?
Lack of standardization. The Unicode consortium really couldn’t choose which numbers are assigned to these “symbols”, or what their typical visual presentation or meaning is. Full backward compatibility with ASCII (at the UTF-8 encoded level) and with Latin-1 (at the code point destination level) forcedly included these code points regardless of various specialized and overloaded values, often associated with them in various text processing contexts.
Wait, you say that XML should not be fully backwards compatible with ASCII, unlike UTF-8?
Yes. It is right. You need a document element. You cannot even add raw < or & . So, why would you ever need to enter raw control characters?