Htmlagilitypack and dynamic content

Question

Htmlagilitypack and dynamic content

Hello to all.

I want to create a web scrapper application and I want to do this using the webbrowser control, htmlagilitypack and xpath.

right now I managed to create an xpath generator (for this I used webbrowser) that works fine, but sometimes I can’t capture dynamically (via javascript or ajax) the generated content. I also learned that when the web browser control (actually IE browser) generates some additional tags, such as "tbody", and again htmlagilitypack `htmlWeb.Load (webBrowser.DocumentStream);` does not see it.

another note. I found out that the following code really captures the current source of the web page, but I could not provide it with htmlagilitypack `(Mshtml.IHTMLDocument3) webBrowser.Document.DomDocument;`

Could you help me? Thanks

+8

c # html-agility-pack dynamic-content

Chyngyz sydykov Apr 16 '12 at 6:17

source share

3 answers

Nick · Answer 1 · 2014-02-22T14:58:58+0000

I just spent hours trying to get HtmlAgilityPack to display some kind of dynamic ajax content from a web page, and I switched from one useless mail to another until I found this one.

The answer is hidden in the comment under the first post, and I thought I should straighten it out.

This is the method that I used initially and did not work:

private void LoadTraditionalWay(String url) { WebRequest myWebRequest = WebRequest.Create(url); WebResponse myWebResponse = myWebRequest.GetResponse(); Stream ReceiveStream = myWebResponse.GetResponseStream(); Encoding encode = System.Text.Encoding.GetEncoding("utf-8"); TextReader reader = new StreamReader(ReceiveStream, encode); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.Load(reader); reader.Close(); }

WebRequest will not execute or execute ajax requests that display missing content.

This is the solution that worked:

 private void LoadHtmlWithBrowser(String url) { webBrowser1.ScriptErrorsSuppressed = true; webBrowser1.Navigate(url); waitTillLoad(this.webBrowser1); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); doc.Load(sr); } private void waitTillLoad(WebBrowser webBrControl) { WebBrowserReadyState loadStatus; int waittime = 100000; int counter = 0; while (true) { loadStatus = webBrControl.ReadyState; Application.DoEvents(); if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive)) { break; } counter++; } counter = 0; while (true) { loadStatus = webBrControl.ReadyState; Application.DoEvents(); if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true) { break; } counter++; } }

The idea is to load using WebBrowser, which is able to display ajax content, and then wait until the page is fully rendered, and then using the Microsoft.mshtml library, re-parse the HTML in the flexibility package.

This was the only way to access dynamic data.

Hope this helps someone

Lee Englestone · Answer 2 · 2015-08-06T17:39:27+0000

Will Selenium do the trick. As far as I know, it creates browser instances. Sorting should allow js to start and let you get the result of a manipulated DOM.

dev · Answer 3 · 2013-03-12T08:48:33+0000

Use the HTML Agility document following the method.

 htmlAgilityPackDocument.LoadHtml(this.browser.DocumentText);

OR

 if (this.browser.Document.GetElementsByTagName("html")[0] != null) _htmlAgilityPackDocument.LoadHtml(this.browser.Document.GetElementsByTagName("html")[0].OuterHtml);

Htmlagilitypack and dynamic content

Hello to all.

More articles: