Conditionally avoid special xml characters

I looked around a lot, but could not find a built-in .Net method that would avoid the special XML characters: < , > , & , ' and " if it is not a tag.

For example, take the following text:

 Test& <b>bold</b> <i>italic</i> <<Tag index="0" /> 

I want it to be converted to:

 Test&amp; <b>bold</b> <i>italic</i> &lt;<Tag index="0" /> 

Please note that tags are not escaped. I need to set this value in InnerXML XmlElement , and as a result, these tags must be saved.

I studied the implementation of my own analyzer and used StringBuilder to optimize as much as possible, but it can become quite annoying.

I also know acceptable tags that can simplify things (only: br, b, i, u, blink, flash, Tag). In addition, these tags may be closing tags.

 (eg <u />) 

or container tags

 (eg <u>...</u>) 
+6
source share
3 answers

NOTE. This can probably be optimized. That was what I quickly knocked you off. Also note that I do not do any checks on the tags themselves. It just searches for content enclosed in angle brackets. It will also fail if an angle bracket is found in the tag (for example, <sometag label="I put an > here"> ). Other than that, I think he should do what you ask.

 namespace ConsoleApplication1 { using System; using System.Text.RegularExpressions; class Program { static void Main(string[] args) { // This is the test string. const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />"; // Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or // a character that needs escaping. string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) => { // If a special (escapable) character was found, replace it. if (match.Groups["Special"].Success) { switch (match.Groups["Special"].Value) { case "<": return "&lt;"; case ">": return "&gt;"; case "\"": return "&quot;"; case "\'": return "&apos;"; case "&": return "&amp;"; default: return match.Groups["Special"].Value; } } // Otherwise, just return what was found. return match.Value; }); // Show the result. Console.WriteLine("Test String: " + testString); Console.WriteLine("Result : " + result); Console.ReadKey(); } } } 
+2
source

I personally do not think that this is possible because you are really trying to correct the wrong HTML, and therefore there are no rules that you can use to determine what should be encoded and what should not.

In any case, you look at it, something like <<Tag index="0" /> not valid HTML.

If you know the actual tags, you can create a whitelist that could simplify the situation, but you will need to more specifically attack your problem, I don’t think you can solve this problem for any scenario.

In fact, most likely, you actually do not have a random < or > lying in your text, and this (possibly) will greatly simplify the problem, but if you are really trying to come up with a general solution ... I wish you good luck.

+2
source

Here the regex that you can use will match any invalid < or > .

 (\<(?! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))|(?<! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))\>) 

I suggest putting the correct test tag expression in a variable and then creating the rest around it.

 var validTags = "b|i|br|u|blink|flash|Tag[^>]*"; var startTag = @"\<(?! ?/?(?:" + validTags + "))"; var endTag = @"(?<! ?/?(?:" + validTags + "))/>"; 

Then just do RegEx.Replace on them.

+1
source

All Articles