Best way to encode text data for XML

I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute and was surprised when I did not find it right away. So, before I go too much further, can I just skip the built-in function?

Assuming for a moment that it really does not exist, I am building my own generic method EncodeForXml(string data) , and I am thinking about how to do it.

The data I use that caused all of this can contain bad characters like &, <, ", etc. It can also sometimes contain properly shielded objects: &, & lt ;, and", which means that just using the CDATA section might not be the best idea. It looks like klunky anyay; I would rather get a good string value that can be used directly in xml.

I used regex in the past to just catch bad ampersands, and I am thinking of using it to catch them in this case, as well as the first step, and then make a simple replacement for other characters.

So, can this be further optimized without making it too complicated, and is there anything I don't see?

 Function EncodeForXml(ByVal data As String) As String Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)") data = badAmpersand.Replace(data, "&amp;") return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;") End Function 

Sorry for everything that you C # are just people - I don’t care what language I use, but I want to make static Regex and you cannot do it in C # without declaring it outside the method, so it will be VB.Net

Finally, we are still working on .Net 2.0, but if someone can take the final product and turn it into an extension method for the string class, that would be cool too.

Update . The first few answers show that .Net really has built-in ways to do this. But now that I started, I kind of want to finish my EncodeForXml () method just for fun, so I'm still looking for ideas for improvement. Remarkably: a more complete list of characters that should be encoded as entities (possibly stored in a list / map) and something that has better performance than executing .Replace () on immutable lines in a serial interface.

+64
xml encoding
Oct 01 '08 at 13:39
source share
12 answers

System.XML handles the encoding for you, so you don't need such a method.

+10
01 Oct '08 at 13:46
source share

Depending on how much you know about the input, you might have to consider that not all Unicode characters are valid XML characters .

Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore illegal XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters illegal characters (unless you disable this check, and in this case she ignores them). An overview of library functions is available here .

Edit 2011/8/14: Having seen that at least a few people have consulted with this answer over the last couple of years, I decided to completely rewrite the source code, which had a lot of problems, including the terribly wrong UTF-16 .

 using System; using System.Collections.Generic; using System.IO; using System.Linq; /// <summary> /// Encodes data so that it can be safely embedded as text in XML documents. /// </summary> public class XmlTextEncoder : TextReader { public static string Encode(string s) { using (var stream = new StringReader(s)) using (var encoder = new XmlTextEncoder(stream)) { return encoder.ReadToEnd(); } } /// <param name="source">The data to be encoded in UTF-16 format.</param> /// <param name="filterIllegalChars">It is illegal to encode certain /// characters in XML. If true, silently omit these characters from the /// output; if false, throw an error when encountered.</param> public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) { _source = source; _filterIllegalChars = filterIllegalChars; } readonly Queue<char> _buf = new Queue<char>(); readonly bool _filterIllegalChars; readonly TextReader _source; public override int Peek() { PopulateBuffer(); if (_buf.Count == 0) return -1; return _buf.Peek(); } public override int Read() { PopulateBuffer(); if (_buf.Count == 0) return -1; return _buf.Dequeue(); } void PopulateBuffer() { const int endSentinel = -1; while (_buf.Count == 0 && _source.Peek() != endSentinel) { // Strings in .NET are assumed to be UTF-16 encoded [1]. var c = (char) _source.Read(); if (Entities.ContainsKey(c)) { // Encode all entities defined in the XML spec [2]. foreach (var i in Entities[c]) _buf.Enqueue(i); } else if (!(0x0 <= c && c <= 0x8) && !new[] { 0xB, 0xC }.Contains(c) && !(0xE <= c && c <= 0x1F) && !(0x7F <= c && c <= 0x84) && !(0x86 <= c && c <= 0x9F) && !(0xD800 <= c && c <= 0xDFFF) && !new[] { 0xFFFE, 0xFFFF }.Contains(c)) { // Allow if the Unicode codepoint is legal in XML [3]. _buf.Enqueue(c); } else if (char.IsHighSurrogate(c) && _source.Peek() != endSentinel && char.IsLowSurrogate((char) _source.Peek())) { // Allow well-formed surrogate pairs [1]. _buf.Enqueue(c); _buf.Enqueue((char) _source.Read()); } else if (!_filterIllegalChars) { // Note that we cannot encode illegal characters as entity // references due to the "Legal Character" constraint of // XML [4]. Nor are they allowed in CDATA sections [5]. throw new ArgumentException( String.Format("Illegal character: '{0:X}'", (int) c)); } } } static readonly Dictionary<char,string> Entities = new Dictionary<char,string> { { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" }, { '<', "&lt;" }, { '>', "&gt;" }, }; // References: // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent // [3] http://www.w3.org/TR/xml11/#charsets // [4] http://www.w3.org/TR/xml11/#sec-references // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect } 

Unit tests and full code can be found here .

+71
Apr 08 '09 at 22:27
source share

SecurityElement.Escape

documented here

+30
01 Oct '08 at 13:47
source share

In the past, I used HttpUtility.HtmlEncode to encode text for xml. In fact, he performs the same task. I have not encountered any problems yet, but this does not mean that I will not be in the future. As the name suggests, it was made for HTML, not XML.

You may have already read it, but here is an article about encoding and decoding xml.

EDIT: Of course, if you use xmlwriter or one of the new XElement classes, this encoding is done for you. In fact, you can just take the text, put it in a new XElement instance, and then return the string (.tostring) version of the element. I heard that SecurityElement.Escape will perform the same task as your utility method, but havent read much about it or used it.

EDIT2: ignore my comment on XElement since you're still on 2.0

+24
01 Oct '08 at 13:45
source share

The Microsoft AntiXss library The AntiXssEncoder class in System.Web.dll has methods for this:

 AntiXss.XmlEncode(string s) AntiXss.XmlAttributeEncode(string s) 

it also has HTML:

 AntiXss.HtmlEncode(string s) AntiXss.HtmlAttributeEncode(string s) 
+15
Aug 29 '09 at 14:36
source share

In .net 3.5+

 new XText("I <want> to & encode this for XML").ToString(); 

Gives you:

I &lt;want&gt; to &amp; encode this for XML I &lt;want&gt; to &amp; encode this for XML Strike>

It turns out that this method does not encode some things that it should (for example, quotation marks).

SecurityElement.Escape ( answer by workmad3 ) seems to do better with this and is included in earlier versions of .net.

If you don't mind third-party code and want no illegal characters to fall into your XML, I would recommend Michael Kropat answer .

+12
Feb 22 2018-12-12T00:
source share

XmlTextWriter.WriteString() is escaping.

+5
01 Oct '08 at 13:48
source share

If this is an ASP.NET application, why not use Server.HtmlEncode ()?

+3
01 Oct '08 at 13:46
source share

This may be the case when you could use the WriteCData method.

 public override void WriteCData(string text) Member of System.Xml.XmlTextWriter Summary: Writes out a <![CDATA[...]]> block containing the specified text. Parameters: text: Text to place inside the CDATA block. 

A simple example would look like this:

 writer.WriteStartElement("name"); writer.WriteCData("<unsafe characters>"); writer.WriteFullEndElement(); 

The result looks like this:

 <name><![CDATA[<unsafe characters>]]></name> 

When reading node values, XMLReader automatically deletes part of the CData of the inner text, so you don’t have to worry about that. The only catch is to store data as the innerText value for the XML node. In other words, you cannot insert the contents of a CData into an attribute value.

+3
Jan 07 '09 at 20:30
source share

Brilliant! That's all I can say.

Here is a variant of VB updated code (not in a class, just a function) that will clear and also misinform xml

 Function cXML(ByVal _buf As String) As String Dim textOut As New StringBuilder Dim c As Char If _buf.Trim Is Nothing OrElse _buf = String.Empty Then Return String.Empty For i As Integer = 0 To _buf.Length - 1 c = _buf(i) If Entities.ContainsKey(c) Then textOut.Append(Entities.Item(c)) ElseIf (AscW(c) = &H9 OrElse AscW(c) = &HA OrElse AscW(c) = &HD) OrElse ((AscW(c) >= &H20) AndAlso (AscW(c) <= &HD7FF)) _ OrElse ((AscW(c) >= &HE000) AndAlso (AscW(c) <= &HFFFD)) OrElse ((AscW(c) >= &H10000) AndAlso (AscW(c) <= &H10FFFF)) Then textOut.Append(c) End If Next Return textOut.ToString End Function Shared ReadOnly Entities As New Dictionary(Of Char, String)() From {{""""c, "&quot;"}, {"&"c, "&amp;"}, {"'"c, "&apos;"}, {"<"c, "&lt;"}, {">"c, "&gt;"}} 
0
Nov 18 '11 at 6:25
source share

You can use the built-in XAttribute class, which automatically processes the encoding:

 using System.Xml.Linq; XDocument doc = new XDocument(); List<XAttribute> attributes = new List<XAttribute>(); attributes.Add(new XAttribute("key1", "val1&val11")); attributes.Add(new XAttribute("key2", "val2")); XElement elem = new XElement("test", attributes.ToArray()); doc.Add(elem); string xmlStr = doc.ToString(); 
0
Apr 23 '15 at 11:04
source share

Here is a single-line solution using XElements. I use it in a very small tool. I don’t need it a second time, so I keep it that way. (His weird arc)

 StrVal = (<xa=<%= StrVal %>>END</x>).ToString().Replace("<xa=""", "").Replace(">END</x>", "") 

Oh and it only works in VB not in C #

0
Mar 30 '17 at 9:55
source share



All Articles