Exclude invalid XML characters in C #

I have a string containing invalid XML characters. How can I avoid (or remove) invalid XML characters before I parse a string?

+72
c # xml escaping
Nov 30 '11 at 18:39
source share
7 answers

As a way to remove invalid XML characters, I suggest using the XmlConvert.IsXmlChar method. It has been added since the .NET Framework 4 and is also introduced in Silverlight. Here is a small sample:

void Main() { string content = "\v\f\0"; Console.WriteLine(IsValidXmlString(content)); // False content = RemoveInvalidXmlChars(content); Console.WriteLine(IsValidXmlString(content)); // True } static string RemoveInvalidXmlChars(string text) { var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray(); return new string(validXmlChars); } static bool IsValidXmlString(string text) { try { XmlConvert.VerifyXmlChars(text); return true; } catch { return false; } } 

And as a way to avoid invalid XML characters, I suggest you use the XmlConvert.EncodeName method. Here is a small sample:

 void Main() { const string content = "\v\f\0"; Console.WriteLine(IsValidXmlString(content)); // False string encoded = XmlConvert.EncodeName(content); Console.WriteLine(IsValidXmlString(encoded)); // True string decoded = XmlConvert.DecodeName(encoded); Console.WriteLine(content == decoded); // True } static bool IsValidXmlString(string text) { try { XmlConvert.VerifyXmlChars(text); return true; } catch { return false; } } 

Update: it should be noted that the encoding operation creates a string with a length that is greater than or equal to the length of the original string. This can be important when you store the encoded row in the database in a row column with a length restriction and check the length of the source row in your application to match the restriction of the data column.

+98
Feb 16 '13 at 16:49
source share

Use SecurityElement.Escape

 using System; using System.Security; class Sample { static void Main() { string text = "Escape characters : < > & \" \'"; string xmlText = SecurityElement.Escape(text); //output: //Escape characters : &lt; &gt; &amp; &quot; &apos; Console.WriteLine(xmlText); } } 
+61
Dec 01 '11 at 13:44
source share

If you are writing xml, just use the classes provided by the framework to create the xml. You don’t have to worry about running away or anything else.

 Console.Write(new XElement("Data", "< > &")); 

Will output

 <Data>&lt; &gt; &amp;</Data> 

If you need to read an XML file with an invalid expression, do not use a regular expression. Use the Html Agility Pack instead.

+19
Nov 30 '11 at 18:46
source share

The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:

 static void Main() { const string content = "\v\U00010330"; string newContent = RemoveInvalidXmlChars(content); Console.WriteLine(newContent); } 

This returns an empty string, but it should not! It should return "\ U00010330" because the character U + 10330 is a valid XML character.

To support surrogate characters, I suggest using the following method:

 public static string RemoveInvalidXmlChars(string text) { if (string.IsNullOrEmpty(text)) return text; int length = text.Length; StringBuilder stringBuilder = new StringBuilder(length); for (int i = 0; i < length; ++i) { if (XmlConvert.IsXmlChar(text[i])) { stringBuilder.Append(text[i]); } else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i])) { stringBuilder.Append(text[i]); stringBuilder.Append(text[i + 1]); ++i; } } return stringBuilder.ToString(); } 
+4
Jul 18 '13 at 23:23
source share

Here is an optimized version of the aforementioned RemoveInvalidXmlChars method, which does not create a new array with every call, which GC emphasizes unnecessarily:

 public static string RemoveInvalidXmlChars(string text) { if (text == null) return text; if (text.Length == 0) return text; // a bit complicated, but avoids memory usage if not necessary StringBuilder result = null; for (int i = 0; i < text.Length; i++) { var ch = text[i]; if (XmlConvert.IsXmlChar(ch)) { result?.Append(ch); } else if (result == null) { result = new StringBuilder(); result.Append(text.Substring(0, i)); } } if (result == null) return text; // no invalid xml chars detected - return original text else return result.ToString(); } 
+3
Apr 27 '16 at 9:19
source share
 // Replace invalid characters with empty strings. Regex.Replace(inputString, @"[^\w\.@-]", ""); 

Regular expression pattern [^ \ w. @ -] matches any character that is not a word character, period, @ symbol or hyphen. A word symbol is any letter, decimal digit, or punctuation mark, such as an underscore. Any character matching this pattern is replaced with String.Empty, which is the string defined by the replacement pattern. To add additional characters to user input, add these characters to the character class in the regular expression pattern. For example, the regex pattern [^ \ w. @ - \%] also accepts a percentage character and a backslash in the input string.

 Regex.Replace(inputString, @"[!@#$%_]", ""); 

See also:

Removing Invalid Characters from XML Name Tag - RegEx C #

Here is a function to remove characters from a specified XML string:

 using System; using System.IO; using System.Text; using System.Text.RegularExpressions; namespace XMLUtils { class Standards { /// <summary> /// Strips non-printable ascii characters /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1 /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0 /// </summary> /// <param name="content">contents</param> /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param> private void StripIllegalXMLChars(string tmpContents, string XMLVersion) { string pattern = String.Empty; switch (XMLVersion) { case "1.0": pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])"; break; case "1.1": pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])"; break; default: throw new Exception("Error: Invalid XML Version!"); } Regex regex = new Regex(pattern, RegexOptions.IgnoreCase); if (regex.IsMatch(tmpContents)) { tmpContents = regex.Replace(tmpContents, String.Empty); } tmpContents = string.Empty; } } } 
0
Nov 30 '11 at 19:29
source share
 string XMLWriteStringWithoutIllegalCharacters(string UnfilteredString) { if (UnfilteredString == null) return string.Empty; return XmlConvert.EncodeName(UnfilteredString); } string XMLReadStringWithoutIllegalCharacters(string FilteredString) { if (UnfilteredString == null) return string.Empty; return XmlConvert.DecodeName(UnfilteredString); } 

This simple method replaces invalid characters with the same value, but accepted in the context of XML.




To write a string, use XMLWriteStringWithoutIllegalCharacters (UnfilteredString string).
To read a string, use XMLReadStringWithoutIllegalCharacters (String FilteredString).

0
May 16 '19 at 16:48
source share



All Articles