Htmlagilitypack and dynamic content

Hello to all.

I want to create a web scrapper application and I want to do this using the webbrowser control, htmlagilitypack and xpath.

right now I managed to create an xpath generator (for this I used webbrowser) that works fine, but sometimes I can’t capture dynamically (via javascript or ajax) the generated content. I also learned that when the web browser control (actually IE browser) generates some additional tags, such as "tbody", and again htmlagilitypack `htmlWeb.Load (webBrowser.DocumentStream);` does not see it.

another note. I found out that the following code really captures the current source of the web page, but I could not provide it with htmlagilitypack `(Mshtml.IHTMLDocument3) webBrowser.Document.DomDocument;`

Could you help me? Thanks

+8
c # html-agility-pack dynamic-content
source share
3 answers

I just spent hours trying to get HtmlAgilityPack to display some kind of dynamic ajax content from a web page, and I switched from one useless mail to another until I found this one.

The answer is hidden in the comment under the first post, and I thought I should straighten it out.

This is the method that I used initially and did not work:

private void LoadTraditionalWay(String url) { WebRequest myWebRequest = WebRequest.Create(url); WebResponse myWebResponse = myWebRequest.GetResponse(); Stream ReceiveStream = myWebResponse.GetResponseStream(); Encoding encode = System.Text.Encoding.GetEncoding("utf-8"); TextReader reader = new StreamReader(ReceiveStream, encode); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.Load(reader); reader.Close(); } 

WebRequest will not execute or execute ajax requests that display missing content.

This is the solution that worked:

 private void LoadHtmlWithBrowser(String url) { webBrowser1.ScriptErrorsSuppressed = true; webBrowser1.Navigate(url); waitTillLoad(this.webBrowser1); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); doc.Load(sr); } private void waitTillLoad(WebBrowser webBrControl) { WebBrowserReadyState loadStatus; int waittime = 100000; int counter = 0; while (true) { loadStatus = webBrControl.ReadyState; Application.DoEvents(); if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive)) { break; } counter++; } counter = 0; while (true) { loadStatus = webBrControl.ReadyState; Application.DoEvents(); if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true) { break; } counter++; } } 

The idea is to load using WebBrowser, which is able to display ajax content, and then wait until the page is fully rendered, and then using the Microsoft.mshtml library, re-parse the HTML in the flexibility package.

This was the only way to access dynamic data.

Hope this helps someone

+18
source share

Will Selenium do the trick. As far as I know, it creates browser instances. Sorting should allow js to start and let you get the result of a manipulated DOM.

+1
source share

Use the HTML Agility document following the method.

 htmlAgilityPackDocument.LoadHtml(this.browser.DocumentText); 

OR

 if (this.browser.Document.GetElementsByTagName("html")[0] != null) _htmlAgilityPackDocument.LoadHtml(this.browser.Document.GetElementsByTagName("html")[0].OuterHtml); 
-4
source share

All Articles