.NET Remove / Remove JavaScript and CSS Code Blocks from an HTML Page

Question

.NET Remove / Remove JavaScript and CSS Code Blocks from an HTML Page

I have an HTML string with blocks of JavaScript and CSS code:

<script type="text/javascript"> alert('hello world'); </script> <style type="text/css"> A:link {text-decoration: none} A:visited {text-decoration: none} A:active {text-decoration: none} A:hover {text-decoration: underline; color: red;} </style>

How to strip these blocks? Any suggestion on regular expressions that can be used to remove them?

+8

html c # regex .net

Ievgen Naida Jun 17 '11 at 8:34

source share

4 answers

Use HTMLAgilityPack to Improve Results

or try this feature

 public string RemoveScriptAndStyle(string HTML) { string Pat = "<(script|style)\\b[^>]*?>.*?</\\1>"; return Regex.Replace(HTML, Pat, "", RegexOptions.IgnoreCase | RegexOptions.Singleline); }

+2

Rajeev Jun 17 '11 at 10:47

source share

Just find the opening <script tag, and then remove everything between it and the closing /script> .

Similarly for style. See Google for tips on creating strings.

+1

cusimar9 Jun 17 '11 at 8:38

source share

I made my bike). It may not be as correct as the HtmlAgilityPack, but it is much faster by about 5-6 times on a 400 kb page. Also enter the lowercase characters and delete the numbers (made for the tokenizer)

  private static readonly List<byte[]> SPECIAL_TAGS = new List<byte[]> { Encoding.ASCII.GetBytes("script"), Encoding.ASCII.GetBytes("style"), Encoding.ASCII.GetBytes("noscript") }; private static readonly List<byte[]> SPECIAL_TAGS_CLOSE = new List<byte[]> { Encoding.ASCII.GetBytes("/script"), Encoding.ASCII.GetBytes("/style"), Encoding.ASCII.GetBytes("/noscript")}; public static string StripTagsCharArray(string source, bool toLowerCase) { var array = new char[source.Length]; var arrayIndex = 0; var inside = false; var haveSpecialTags = false; var compareIndex = -1; var singleQouteMode = false; var doubleQouteMode = false; var matchMemory = SetDefaultMemory(SPECIAL_TAGS); for (int i = 0; i < source.Length; i++) { var let = source[i]; if (inside && !singleQouteMode && !doubleQouteMode) { compareIndex++; if (haveSpecialTags) { var endTag = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS_CLOSE, ref matchMemory); if (endTag) haveSpecialTags = false; } if (!haveSpecialTags) { haveSpecialTags = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS, ref matchMemory); } } if (haveSpecialTags && let == '"') { doubleQouteMode = !doubleQouteMode; } if (haveSpecialTags && let == '\'') { singleQouteMode = !singleQouteMode; } if (let == '<') { matchMemory = SetDefaultMemory(SPECIAL_TAGS); compareIndex = -1; inside = true; continue; } if (let == '>') { inside = false; continue; } if (inside) continue; if (char.IsDigit(let)) continue; if (haveSpecialTags) continue; array[arrayIndex] = toLowerCase ? Char.ToLowerInvariant(let) : let; arrayIndex++; } return new string(array, 0, arrayIndex); } private static bool[] SetDefaultMemory(List<byte[]> specialTags) { var memory = new bool[specialTags.Count]; for (int i = 0; i < memory.Length; i++) { memory[i] = true; } return memory; }

+1

Suhan Jul 03 '13 at 9:05

source share

Elian ebbing · Accepted Answer · 2011-06-17T09:20:46+0000

The quick "n" dirty method will be a regular expression:

 var regex = new Regex( "(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase ); string ouput = regex.Replace(input, "");

A better * (but maybe slower) option would be to use HtmlAgilityPack :

 HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(htmlInput); var nodes = doc.DocumentNode.SelectNodes("//script|//style"); foreach (var node in nodes) node.ParentNode.RemoveChild(node); string htmlOutput = doc.DocumentNode.OuterHtml;

*) For a discussion of why this is better, see this thread .

.NET Remove / Remove JavaScript and CSS Code Blocks from an HTML Page

More articles: