How to remove only the <body> tag from the site

Question

How to remove only the <body> tag from the site

I am working on a web browser. At the moment, I clear all the content, and then with the help of a regular expression I delete <meta>, <script>, <style> and other tags and get the contents of the body.

However, I am trying to optimize performance, and I was wondering if there is a way to clear only <body> pages?

 namespace WebScrapper { public static class KrioScraper { public static string scrapeIt(string siteToScrape) { string HTML = getHTML(siteToScrape); string text = stripCode(HTML); return text; } public static string getHTML(string siteToScrape) { string response = ""; HttpWebResponse objResponse; HttpWebRequest objRequest = (HttpWebRequest) WebRequest.Create(siteToScrape); objRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; " + "Windows NT 5.1; .NET CLR 1.0.3705)"; objResponse = (HttpWebResponse) objRequest.GetResponse(); using (StreamReader sr = new StreamReader(objResponse.GetResponseStream())) { response = sr.ReadToEnd(); sr.Close(); } return response; } public static string stripCode(string the_html) { // Remove google analytics code and other JS the_html = Regex.Replace(the_html, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets the_html = Regex.Replace(the_html, "<style.*?</style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove HTML tags the_html = Regex.Replace(the_html, "</?[az][a-z0-9]*[^<>]*>", ""); // Remove HTML comments the_html = Regex.Replace(the_html, "<!--(.|\\s)*?-->", ""); // Remove Doctype the_html = Regex.Replace(the_html, "<!(.|\\s)*?>", ""); // Remove excessive whitespace the_html = Regex.Replace(the_html, "[\t\r\n]", " "); return the_html; } } }

From Page_Load I call the scrapeIt() method, passing it the string that I get from the text box from the page.

+7

c # .net html-parsing web-scraping

Johancho Aug 16 '11 at 17:51

source share

3 answers

I would suggest using the HTML Agility Pack to process / manipulate HTML.

You can easily select a body as follows:

 var webGet = new HtmlWeb(); var document = webGet.Load(url); document.DocumentNode.SelectSingleNode("//body")

+5

Joel beckham Aug 16 '11 at 17:55

source share

Another simplest / fastest (least accurate) method.

 int start = response.IndexOf("<body", StringComparison.CurrentCultureIgnoreCase); int end = response.LastIndexOf("</body>", StringComparison.CurrentCultureIgnoreCase); return response.Substring(start, end-start + "</body>".Length);

Obviously if there is javascript in the HEAD tag, for example ...

 document.write("<body>");

Then you get a little more than you want.

+4

Louis ricci Aug 16 '11 at 18:00

source share

Kiril · Accepted Answer · 2011-08-16T18:00:39+0000

I think your best option is to use a lightweight HTML parser ( something like Majestic 12 , which, based on my tests, is about 50 -100% faster than the HTML Agility Pack) and only process nodes that interest you (something between <body> and </body> ). Majestic 12 is a bit more complicated than the HTML Agility Pack, but if you're looking for performance, then this will definitely help you!

This will allow you to block access to what you ask, but you still have to load the entire page. I don’t think there is a way around this. What you save actually generates DOM nodes for all other content (except the body). You will have to analyze them, but you can skip all node content that you are not interested in processing.

Here is a good example of using the M12 parser.

I do not have a ready-made example of how to capture the body, but I have one way only to capture links and with small changes that it will receive. Here is a rough version:

 GrabBody(ParserTools.OpenM12Parser(_response.BodyBytes));

You need to open Parser M12 (the example project that comes with M12 contains comments that describe in detail how all these parameters affect performance, and they do !!!):

 public static HTMLparser OpenM12Parser(byte[] buffer) { HTMLparser parser = new HTMLparser(); parser.SetChunkHashMode(false); parser.bKeepRawHTML = false; parser.bDecodeEntities = true; parser.bDecodeMiniEntities = true; if (!parser.bDecodeEntities && parser.bDecodeMiniEntities) parser.InitMiniEntities(); parser.bAutoExtractBetweenTagsOnly = true; parser.bAutoKeepScripts = true; parser.bAutoMarkClosedTagsWithParamsAsOpen = true; parser.CleanUp(); parser.Init(buffer); return parser; }

Disassemble the body:

 public void GrabBody(HTMLparser parser) { // parser will return us tokens called HTMLchunk -- warning DO NOT destroy it until end of parsing // because HTMLparser re-uses this object HTMLchunk chunk = null; // we parse until returned oChunk is null indicating we reached end of parsing while ((chunk = parser.ParseNext()) != null) { switch (chunk.oType) { // matched open tag, ie <a href=""> case HTMLchunkType.OpenTag: if (chunk.sTag == "body") { // Start generating the DOM node (as shown in the previous example link) } break; // matched close tag, ie </a> case HTMLchunkType.CloseTag: break; // matched normal text case HTMLchunkType.Text: break; // matched HTML comment, that stuff between <!-- and --> case HTMLchunkType.Comment: break; }; } }

Creating DOM nodes is difficult, but the Majestic12ToXml class will help you with this. As I said, this is in no way equivalent to the 3-liner that you saw with the HTML flexibility package, but as soon as you get the tools down, you can get exactly what you need for a fraction of the cost of performance and probably as much lines of code.

How to remove only the <body> tag from the site

More articles: