A way to crop HTML tags is not in a safe list

Is there a method that removes all HTML tags that are not in the list of safe tags? If this does not happen, what would the regex method be to achieve it?

I am looking for something like PHP strip_tags .

+1
source share
3 answers

The NullUserException answer is perfect, I made a small extension method to do this, and I am posting here if anyone else needs it.

 using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using System.IO; namespace Extenders { public static class StringExtender { internal static void ParseHtmlDocument(XmlDocument doc, XmlNode root, string[] allowedTags, string[] allowedAttributes, string[] allowedStyleKeys) { XmlNodeList nodes; if (root == null) root = doc.ChildNodes[0]; nodes = root.ChildNodes; foreach (XmlNode node in nodes) { if (!(allowedTags.Any(x => x.ToLower() == node.Name.ToLower()))) { var safeNode = doc.CreateTextNode(node.InnerText); root.ReplaceChild(safeNode, node); } else { if (node.Attributes != null) { var attrList = node.Attributes.OfType<XmlAttribute>().ToList(); foreach (XmlAttribute attr in attrList) { if (!(allowedAttributes.Any(x => x.ToLower() == attr.Name))) { node.Attributes.Remove(attr); } // TODO: if style is allowed, check the allowed keys: values } } } if (node.ChildNodes.Count > 0) ParseHtmlDocument(doc, node, allowedTags, allowedAttributes, allowedStyleKeys); } } public static string ParseSafeHtml(this string input, string[] allowedTags, string[] allowedAttributes, string[] allowedStyleKeys) { var xmlDoc = new XmlDocument(); xmlDoc.LoadXml("<span>" + input + "</span>"); ParseHtmlDocument(xmlDoc, null, allowedTags, allowedAttributes, allowedStyleKeys); string result; using (var sw = new StringWriter()) { using (var xw = new XmlTextWriter(sw)) xmlDoc.WriteTo(xw); result = sw.ToString(); } return result.Substring(6, result.Length - 7); } } } 

For use:

 var x = "<b>allowed</b><b class='text'>allowed attr</b><b id='5'>not allowed attr</b><i>not all<b>o</b>wed tag</i>".ParseSafeHtml((new string[] { "b", "#text" }), (new string[] { "class" }), (new string[] { })); 

What outputs:

 <b>allowed</b><b class='text'>allowed attr</b><b>not allowed attr</b>not allowed tag 

If the item is not resolved, it will receive an innerText and pull out the tag, removing all the internal tags.

+2
source share

Do it. Not. Use. Regex. For. Syntactic. HTML

Use XML parser:
MSDN Link
Simple tutorial
HTMLAgilityPack

+7
source share

You can use the MS AntiXSS library to disinfect potentially executable HTML. Take a look at this here:

http://msdn.microsoft.com/en-us/security/aa973814.aspx

http://wpl.codeplex.com/

0
source share

All Articles