I am trying to save a couple of web pages using a web crawler. I usually prefer to do this with perl WWW::Mechanize
modul. However, as far as I can tell, on the site that I am trying to execute, there are many javascripts on it that seem difficult to avoid. So I reviewed the following perl modules
Firefox extension MozRepl works fine. I can use the terminal to navigate the website as shown in the developer's tutorial - theoretically. However, I have no idea about javascript, and therefore it is difficult for me to use the modules correctly.
So here is a source that I like to start with: Morgan Stanley
For the several companies listed below, βCompanies - as of 10/14/2011,β I like to keep their respective pages. For instance. by clicking on the first registered company (ie "1-800-Flowers.com, Inc"), the javascript function is called with two arguments β dtxt('FLWS.O','2011-10-14')
, which creates the desired new page. The page I now want to save locally.
With the perl module of MozRepl
I thought of something like this:
use strict; use warnings; use MozRepl; my $repl = MozRepl->new; $repl->setup; $repl->execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")'); $repl->repl_enter({ source => "content" }); $repl->execute('dtxt("FLWS.O", "2011-10-14")');
Now I like to save the created HTML page.
So, the desired code that I like to do is to visit a couple of firms on my HTML site and just save the web page. (For example, there are three firms here: MMM.N, FLWS.O, SSRX.O)
- Is it right that I cannot get around the javascript function of the page and therefore cannot use
WWW::Mechanize
? - The next question 1 are the mentioned perl modules plausible approach?
- And finally, if you say that the first two questions can be addressed with yes, it would be very nice if you could help me with the actual encoding. For instance. in the above code, the essential part that is missing is equal to the
'save'-command
. (Perhaps using the Firefox saveDocument
function?)
source share