Why are control characters illegal in XML 1.0?

There are many characters that are not encoded in XML 1.0, for example. U+0007 ('bell') and U+001B ('escape'). Most of them are characters without space characters.

From (eg) > this question it is clearly visible that it is an XML specification, that a question - but can someone tell me why the XML specification forbids these characters?

It seems like it took them to be encoded in screens, for example. like  and  respectively, but perhaps there is a practical reason why the characters were forbidden and not required in order to escape?

Defendants suggested that there is some motivation to avoid transmission control characters, but Unicode includes many other control characters (consider U+200C "zero join non joiner"). I admit that there can be no good reason for such behavior, but I would still like to better understand it.

This is especially unpleasant because when these character values ​​appear in other encodings data formats, I end up “double escaping” the new XML documents that should encode it.

+54
xml unicode history
Dec 31 '09 at 21:48
source share
6 answers

I understand that this range is prohibited on the grounds that the markup language should not have any need to support transmission and flow control characters, including creating problems for any editors and parsers in binary conversion.

I'm struggling to find anything from ex cathedra from Tim Bray et al, though.

edit: some discussion of control characters and vague confession was not too reworked:

At 09:27 on 17/06/00 -0500, Mark Volkmann wrote:

I have never seen a discussion of the reason why most ASCII character controls, such as form feeds, are not allowed in XML documents. Can someone tell me the reason for this decision or call me a specification. what explains what?

I'm not sure that we will do it the same way if we do it again. I don’t see that they really do any harm. It’s clear that if you are optimizing for a highly integrated content markup language (and XML is) it is legal to be suspicious of things like vertical tab and backspace etc ... but then how can this be left in \ n and DEL and etc? -Tim

+19
Dec 31 '09 at 22:42
source share

It was a long time ago, but my best memory was that they did not have a graphical representation, nor did they have a consistent semantics. Choosing a pair in random order, we see U + 0006 "Confirmation" or U + 0016 "Synchronous idle" ... what does this mean? Unicode does not speak. Even when everyone claimed support for ASCII, there was no compatibility around this garbage. XML is supposed to be related to compatibility.

The experience is that people who want to use these things really want to drive binary data into their XML elements (and the next thing they want to do is to include U + 0000 NULL), which has been clearly non-targeted XML from day one. If you want to represent the numbers 0x6 or 0x16, there are many good ways to do something that does not mutate the concept of “symbol”.

+14
Feb 02 '09 at 16:52
source share

It seems like it could have been required that they be encoded in escapes, eg as & # x0007; and & # x001B;

You can do just that in XML 1.1, for all but \ 0.

+13
Jan 02 '09 at 13:55
source share

Probably time for reanalysis, also considering XML 1.1.

What character code breakpoints exist in Unicode?

  • U+0000 to U+001f , inherited from ASCII.
  • U+007F , inherited from ASCII
  • U+0080 to U+009F , inherited from Latin-1
  • various special-purpose ranges, standardized explicitly for Unicode, and are mostly useful, especially in non-markup contexts. They are discussed here in blocks, including reasons why and how to use them, or not to use them in XML, and what to do if you come across them anyway.

How does XML view these control characters?

This is a different classification.

  • Tab and new line (regardless of the platform’s dependence on the new line). Everyone uses them. Everyone knows what they should stand for. It is allowed in almost all known forms, often even for fairly printed printing of the markup itself.
  • U+0000 is evil. Zero character? String terminator? Binary noise? The opposite of both interoperability and markup. It is forbidden in all forms.
  • Anything else? Underutilized, problematic interoperability, but there are ways to tolerate them without even knowing that they should be “controlled”.

Now we turn our attention only to this last category, actually to control codes. That is, the following summary does NOT apply to tabs and newlines: U+0009 , U+000a , U+000D , U+0085 , U+2028 .

XML 1.0 allows you to use all of the above ranges of control characters, except U+0000 to U+001f , as text (directly included characters), and even those (except for evil U+0000 ) are allowed as numeric symbolic links . U+007F to U+009F was clearly inaction, and this inconsistency was fixed in XML 1.1, but vice versa. They even gave a detailed explanation inside the standard:

Finally, there is considerable demand for defining the standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows you to use references to C # x1 to # x1F control characters, most of which are prohibited in XML 1.0. However, for reliability reasons, these characters still cannot be used directly in documents. To increase the reliability of character encoding detection, additional control characters # x7F through # x9F, which were freely permitted in XML 1.0 documents, should now also be displayed only as symbol references. (Simple characters, of course, are freed.) A small sacrifice of backward compatibility is considered negligible. Due to potential problems with the API, # x0 is still forbidden both directly and as a symbol reference.

Why do Unicode and XML allow markup-like characters to be used free of charge, except for a few "inherited" ranges? People should use markup for them.

Unicode is also used in non-markup contexts, and it is still an evolving character set. It would be too difficult to implement an appropriate XML processor if the set of uncontrolled characters was a moving target.

OK, what's wrong with inherited ranges compared to the Unicode control character characters?

Lack of standardization. The Unicode consortium really couldn’t choose which numbers are assigned to these “symbols”, or what their typical visual presentation or meaning is. Full backward compatibility with ASCII (at the UTF-8 encoded level) and with Latin-1 (at the code point destination level) forcedly included these code points regardless of various specialized and overloaded values, often associated with them in various text processing contexts.

Wait, you say that XML should not be fully backwards compatible with ASCII, unlike UTF-8?

Yes. It is right. You need a document element. You cannot even add raw < or & . So, why would you ever need to enter raw control characters?

+8
Apr 23 '15 at 15:31
source share

XML was developed specifically for Unicode (specifically, UTF-8 and UTF-16) and ISO / IEC 10646, both of which (I'm not quite sure about ISO 10646) contain transmission / stream control characters that were left from ASCII and days character based terminals. Although these characters are still in use, they do not belong to a format such as XML.

As for these new encodings that use these codes for something else, well, it looks like the XML specification might need to be adapted.

+1
Dec 31 '09 at 22:48
source share

Why are you avoiding them twice? This seems like a good place for & bell; and & avoid ;. (Undefined, handled by a callback from the analyzer to your code)

+1
Jan 09 '09 at 14:53
source share



All Articles