How to dynamically generate HTML using .NET WebBrowser or mshtml.HTMLDocument?

Most of the answers I read regarding this topic point to either the System.Windows.Forms.WebBrowser class or the COM interface mshtml.HTMLDocument from the Microsoft HTML Object Library.

The WebBrowser class has not brought me anywhere. The following code cannot get the HTML code displayed by my web browser:

[STAThread] public static void Main() { WebBrowser wb = new WebBrowser(); wb.Navigate("https://www.google.com/#q=where+am+i"); wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e) { mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument; foreach (IHTMLElement element in doc.all) { System.Diagnostics.Debug.WriteLine(element.outerHTML); } }; Form f = new Form(); f.Controls.Add(wb); Application.Run(f); } 

The above example. I’m not very interested in finding a workaround to find out the name of the city where I am located. I just need to understand how to programmatically retrieve dynamically generated data.

(Call the new System.Net.WebClient.DownloadString (" https://www.google.com/#q=where+am+i "), save the received text somewhere, find the name of the city in which you are currently located, and let me know if you can find it.)

But when I access https://www.google.com/#q=where+am+i from my web browser (i.e. firefox), I see the name of my city written on the web page. In Firefox, if I right-click on a city name and select "Inspect Element (Q)", I can clearly see the city name written in HTML code, which seems to be very different from the raw HTML returned by WebClient.

After I was tired of playing System.Net.WebBrowser, I decided to give mshtml.HTMLDocument a shot to end up with the same useless raw HTML:

 public static void Main() { mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument(); doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i")); foreach (IHTMLElement e in doc.all) { System.Diagnostics.Debug.WriteLine(e.outerHTML); } } 

I suppose there should be an elegant way to get such information. Now I can only add the WebBrowser control to the form, go to the URL in question, send the keys "CLRL, A" and copy everything that will be displayed on the page to the clipboard, and try to analyze it. This is a terrible decision.

+11
javascript html c # webbrowser-control
Jan 05 '14 at 5:23
source share
2 answers

I would like to add some code to Alexey to answer . A few points:

  • Strictly speaking, it is not always possible to determine when a page has finished rendering with a 100% probability. Some pages are quite complex and use continuous AJAX updates. But we can be pretty close by polling the current HTML snapshot for changes and checking the WebBrowser.IsBusy property. That LoadDynamicPage below.

  • Some timeout logic should be present above if the page rendering is infinite ( CancellationTokenSource note).

  • Async/await is a great tool for coding this, as it gives linear code to our asynchronous polling logic, which greatly simplifies it.

  • It is important to enable HTML5 rendering using the Control Browser Feature , since WebBrowser works in IE7 emulation mode by default. What SetFeatureBrowserEmulation does below.

  • This is a WinForms application, but the concept can be easily converted to a console application .

  • This logic works well with the URL you specify: https://www.google.com/#q=where+am+i .

 using Microsoft.Win32; using System; using System.ComponentModel; using System.Diagnostics; using System.Threading; using System.Threading.Tasks; using System.Windows.Forms; namespace WbFetchPage { public partial class MainForm : Form { public MainForm() { SetFeatureBrowserEmulation(); InitializeComponent(); this.Load += MainForm_Load; } // start the task async void MainForm_Load(object sender, EventArgs e) { try { var cts = new CancellationTokenSource(10000); // cancel in 10s var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token); MessageBox.Show(html.Substring(0, 1024) + "..." ); // it too long! } catch (Exception ex) { MessageBox.Show(ex.Message); } } // navigate and download async Task<string> LoadDynamicPage(string url, CancellationToken token) { // navigate and await DocumentCompleted var tcs = new TaskCompletionSource<bool>(); WebBrowserDocumentCompletedEventHandler handler = (s, arg) => tcs.TrySetResult(true); using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true)) { this.webBrowser.DocumentCompleted += handler; try { this.webBrowser.Navigate(url); await tcs.Task; // wait for DocumentCompleted } finally { this.webBrowser.DocumentCompleted -= handler; } } // get the root element var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0]; // poll the current HTML for changes asynchronosly var html = documentElement.OuterHtml; while (true) { // wait asynchronously, this will throw if cancellation requested await Task.Delay(500, token); // continue polling if the WebBrowser is still busy if (this.webBrowser.IsBusy) continue; var htmlNow = documentElement.OuterHtml; if (html == htmlNow) break; // no changes detected, end the poll loop html = htmlNow; } // consider the page fully rendered token.ThrowIfCancellationRequested(); return html; } // enable HTML5 (assuming we're running IE10+) // more info: https://stackoverflow.com/a/18333982/1768303 static void SetFeatureBrowserEmulation() { if (LicenseManager.UsageMode != LicenseUsageMode.Runtime) return; var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName); Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION", appName, 10000, RegistryValueKind.DWord); } } } 
+17
Jan 05 '14 at 14:10
source share

Your web browser code looks reasonable - wait for something that captures the current content. Unfortunately, the official “I did not run JavaScript, feel free to steal content” from the browser and JavaScript.

It may take some active wait (not Sleep , but Timer ) and depends on the page. Even if you use a browser without a browser (i.e. PhantomJS), you will have the same problem.

+6
Jan 05 '14 at 5:33
source share



All Articles