Scrambling data dynamically generated by JavaScript in an html document using C #

Question

Scrambling data dynamically generated by JavaScript in an html document using C #

How can I copy data that is dynamically generated by JavaScript in an html document using C #?

Using WebRequest and HttpWebResponse in the C # library, I can get all the html source code as a string, but the problem is that the data I want is not contained in the source code; data is generated dynamically using JavaScript.

On the other hand, if the data I want is already in the source code, then I can easily use it using regular expressions.

I downloaded the HtmlAgilityPack , but I don't know if it will take care of where the elements are generated dynamically using JavaScript ...

Many thanks!

+8

javascript dom html http c #

user3213711 Jun 09 '14 at 23:31

source share

2 answers

You can take a look at a tool like Selenium, for scraping pages with Javascript.

http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

+4

vikramsk Jun 10 '14 at 4:48

source share

Pandepic · Accepted Answer · 2014-06-10T04:26:38+0000

When you create WebRequest, you ask the server to provide you with a sample file, this file has not yet been parsed / executed by a web browser and therefore javascript has not done anything on it yet.

You need to use the tool to execute JavaScript on the page if you want to see how the page looks after analysis in the browser. One option is to use the built-in .net web browser: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx

The web browser control can move and load the page, and then you can request its DOM, which will be changed by JavaScript on the page.

EDIT (example):

 Uri uri = new Uri("http://www.somewebsite.com/somepage.htm"); webBrowserControl.AllowNavigation = true; // optional but I use this because it stops javascript errors breaking your scraper webBrowserControl.ScriptErrorsSuppressed = true; // you want to start scraping after the document is finished loading so do it in the function you pass to this handler webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted); webBrowserControl.Navigate(uri);

 private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div"); foreach (HtmlElement div in divs) { //do something } }

Scrambling data dynamically generated by JavaScript in an html document using C #

More articles: