How to clear a page containing data updated with JavaScript after the page loads?

Question

How to clear a page containing data updated with JavaScript after the page loads?

I am trying to clear the page. Everything is fine, but when the values are updated, the source code of the page remains unchanged for a minute. Even when I refresh a page with a slow Internet connection, first I see the old data, and only after the page is fully loaded, the current values. I think javascript is updating them. BUt still need to download them.

How to get current values?

I am writing my program in C #, but if you have some ideas / tips / examples, the language really doesn't matter.

Thanks.

+4

javascript c # web-scraping

Alena May 18, '11 at 16:58

source share

3 answers

Mel · Answer 1 · 2011-05-19T04:17:51+0000

You're right - javascript probably updates data after loading.

I could think of three ways to handle this:

Use a web browser control - I assume that you are using the HttpWebRequest object to retrieve values from the site. This will not work if you need to run javascript. You can use the webbrowser control to let javascript run and retrieve values from the DOM. The only thing I don't like about this approach is it looks like a hack and probably too awkward for prod applications. You should also know when to read the contents of the DOM (updating can be done in the background). Google "C # WebBrowser Control Read DOM Programmatically" or you can find out more about it here .
I personally prefer this over the previous one, but it doesn't work all the time. First you need to check the site using firebug or something else and see which URLs are being called from the background. Say, for example, a site updates stock quotes using javascript. Most likely, it uses an asynchronous request to receive updated information from a web service. Using firebug , you can view this in NET> XHR. Now this is the hard part. Ok, look at the query and return values. The idea is that you can try to get the values yourself and analyze the contents, which can be a lot easier than clearing the page. The problem is that you need to do a bit of reverse engineering so that everything is correct. You may also encounter problems with authentication and / or encryption.
Finally, my most preferred solution asks the owner of [your site for scraping] directly.

Jeffmiller · Answer 2 · 2011-05-19T21:38:02+0000

There are tools for this that automate the web browser through C #: iMacros Scripting Edition or WatiN. iMacros is easier to use, but Watin is free. Both have a large community of users.

Bh · Answer 3 · 2013-04-18T12:54:26+0000

I think the WebBrowser management approach is probably good and independent of third-party libraries. Here is what I intend to use, and this solves the problem of waiting for the page to load:

private string ReadPage(string Link) { using (var client = new WebClient()) { this.wbrwPages.Navigate(Link); while (this.wbrwPages.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); } ReadPage = this.wbrwPages.DocumentText; } }

I will get information from HTML through some form of DOM or XPath. I am curious if others have comments about entering the "while" loop and depending on the "full" state to get me out of it. I can also set a timer there - just to be safe.

How to clear a page containing data updated with JavaScript after the page loads?

More articles: