Handling strings to insert into XElement

We collect many lines and send them to our clients in xml fragments. These lines can contain literally any character. We found an error caused by an attempt to serialize XElement instances containing bad characters. Here is an example:

var message = new XElement("song"); char c = (char)0x1a; //sub var someData = string.Format("some{0}stuff", c); var attr = new XAttribute("someAttr", someData); message.Add(attr); string msgStr = message.ToString(SaveOptions.DisableFormatting); //exception here 

The above code throws an exception on the specified line. Here's the stacktrace:

  'SUB', hexadecimal value 0x1A, is an invalid character.  System.ArgumentException System.ArgumentException: '', hexadecimal value 0x1A, is an invalid character.
    at System.Xml.XmlEncodedRawTextWriter.InvalidXmlChar (Int32 ch, Char * pDst, Boolean entitize)
    at System.Xml.XmlEncodedRawTextWriter.WriteAttributeTextBlock (Char * pSrc, Char * pSrcEnd)
    at System.Xml.XmlEncodedRawTextWriter.WriteString (String text)
    at System.Xml.XmlWellFormedWriter.WriteString (String text)
    at System.Xml.XmlWriter.WriteAttributeString (String prefix, String localName, String ns, String value)
    at System.Xml.Linq.ElementWriter.WriteStartElement (XElement e)
    at System.Xml.Linq.ElementWriter.WriteElement (XElement e)
    at System.Xml.Linq.XElement.WriteTo (XmlWriter writer)
    at System.Xml.Linq.XNode.GetXmlString (SaveOptions o)

My suspicion is that this is incorrect behavior, and a bad char should be escaped in XML. Whether this is desirable or not is a question that I will answer later.

So here is the question:

Is there a way to handle strings in such a way that this error may not occur, or should I just delete all characters below char 0x20 and cross my fingers?

+7
source share
2 answers

This is what I use in my code:

  static Lazy<Regex> ControlChars = new Lazy<Regex>(() => new Regex("[\x00-\x1f]", RegexOptions.Compiled)); private static string FixData_Replace(Match match) { if ((match.Value.Equals("\t")) || (match.Value.Equals("\n")) || (match.Value.Equals("\r"))) return match.Value; return "&#" + ((int)match.Value[0]).ToString("X4") + ";"; } public static string Fix(object data, MatchEvaluator replacer = null) { if (data == null) return null; string fixed_data; if (replacer != null) fixed_data = ControlChars.Value.Replace(data.ToString(), replacer); else fixed_data = ControlChars.Value.Replace(data.ToString(), FixData_Replace); return fixed_data; } 

All characters below 0x20 (except \ r \ n \ t) are replaced by their Unicode XML codes: 0x1f => "& # 001f". The Xml parser should automatically cancel it back to 0x1f when reading the file. Just use the new XAttribute ("attribute", Fix (yourString))

It works for XElement a content, it should probably work for XAttributes as well.

+5
source

A little copying with ILSpy showed that you can use the XmlWriter / ReaderSettings.CheckCharacters field to control whether an exception is thrown for invalid characters. Borrowing the XNode.ToString method and the XDocument.Parse method, I gave the following examples:

To create an XLinq object with invalid (control) characters:

 XDocument xdoc = XDocument.Parse("<root>foo</root>"); using (StringWriter stringWriter = new StringWriter()) { XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { OmitXmlDeclaration = true, CheckCharacters = false }; using (XmlWriter xmlWriter = XmlWriter.Create(stringWriter, xmlWriterSettings)) { xdoc.WriteTo(xmlWriter); } return stringWriter.ToString(); } 

To parse an XLinq object with invalid characters:

 XDocument xdoc; using (StringReader stringReader = new StringReader(text)) { XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false, DtdProcessing = DtdProcessing.Parse, MaxCharactersFromEntities = 10000000L, XmlResolver = null }; using (XmlReader xmlReader = XmlReader.Create(stringReader, xmlReaderSettings)) { xdoc = XDocument.Load(xmlReader); } } 
+8
source

All Articles