XML Spec and UTF-16

Section 4.3.3 and Appendix F The XML 1.0 specification talks about UTF-16 , the Byte Order Value (BOM) in UTF -16 encoded data streams, and the XML encoding declaration. From the information in these sections, it seems that UTF-16 documents require a byte order sign. But the summary diagram in Appendix F gives a scenario where the UTF-16 input does not have a byte byte mark, but this script has an xml declaration. According to Section 4.3.3, an encoding declaration is not required in the UTF-16 encoded document (and the XML declaration itself is optional in this case).

Given this information, is an UTF-16 XML document containing neither the specification nor the XML declaration, which also does not have the encoding information provided outside that is considered valid if the rest of the document?

+6
source share
1 answer

From the Unicode 6.2 specification (p. 99):

The UTF-16 encoding scheme may or may not begin with a specification. However, when there is no specification and in the absence of a higher level protocol, the byte order of the UTF-16 encoding scheme is large.

Therefore, specification is not required in UTF-16. But there may be a “higher level protocol”, such as an XML specification, to indicate what to do for UTF-16 XML documents without a specification.

Section 4.3.3 of the XML 1.0 specification states:

Objects encoded in UTF-16 MUST and objects encoded in UTF-8 may begin with the byte order character described in Appendix H of [ISO / IEC 10646: 2000], clause 16.8 [Unicode] (ZERO WIDTH NO-BREAK SPACE, #xFEFF).

Let us return to the above. Appendix F describes approaches to detecting character encodings in the absence of a specification. But I don’t think this section is relevant to your question, since you are asking if the UTF-16 XML document without the specification and without the XML declaration is “well formed”, and Appendix F is a non-normative part of the specification.

So, back to the specification, the document is well-formed if "Taken as a whole, it corresponds to the document with the inscription production." (Section 2.1). A review of document shows that an XML declaration is optional (this is also mentioned in section 2.8). Thus, it is possible to have a well-formed document without an XML declaration; this answers half your question.

The other half is that a UTF-16 XML document without XML declaration, but also without specification, can be well formed. In Section 4.3.3 he says (emphasis mine):

In the absence of information provided by an external transport protocol (for example, HTTP or MIME), this is a fatal error for an object that includes an encoding declaration that must be presented to the XML processor in an encoding other than that specified in the declaration, or for an object that begins with a byte character ordering and declaring an encoding to use an encoding other than UTF-8 .

Based on this, a UTF-16 XML document without a specification and without an encoding declaration (which is part of the XML declaration) is not a well-formed document (since a fatal error violates the correctness, see the definition of correctness restriction in section 1.2) in the absence of external information. This also matches what was said earlier in section 4.3.3 about the specification requirement for UTF-16.

+7
source

All Articles