Web crawler: using the Perl MozRepl module to work with Javascript

I am trying to save a couple of web pages using a web crawler. I usually prefer to do this with perl WWW::Mechanize modul. However, as far as I can tell, on the site that I am trying to execute, there are many javascripts on it that seem difficult to avoid. So I reviewed the following perl modules

Firefox extension MozRepl works fine. I can use the terminal to navigate the website as shown in the developer's tutorial - theoretically. However, I have no idea about javascript, and therefore it is difficult for me to use the modules correctly.

So here is a source that I like to start with: Morgan Stanley

For the several companies listed below, β€œCompanies - as of 10/14/2011,” I like to keep their respective pages. For instance. by clicking on the first registered company (ie "1-800-Flowers.com, Inc"), the javascript function is called with two arguments β†’ dtxt('FLWS.O','2011-10-14') , which creates the desired new page. The page I now want to save locally.

With the perl module of MozRepl I thought of something like this:

 use strict; use warnings; use MozRepl; my $repl = MozRepl->new; $repl->setup; $repl->execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")'); $repl->repl_enter({ source => "content" }); $repl->execute('dtxt("FLWS.O", "2011-10-14")'); 

Now I like to save the created HTML page.

So, the desired code that I like to do is to visit a couple of firms on my HTML site and just save the web page. (For example, there are three firms here: MMM.N, FLWS.O, SSRX.O)

  • Is it right that I cannot get around the javascript function of the page and therefore cannot use WWW::Mechanize ?
  • The next question 1 are the mentioned perl modules plausible approach?
  • And finally, if you say that the first two questions can be addressed with yes, it would be very nice if you could help me with the actual encoding. For instance. in the above code, the essential part that is missing is equal to the 'save'-command . (Perhaps using the Firefox saveDocument function?)
+4
source share
1 answer

The web works through HTTP requests and responses.

If you can find the right send request, then you will get the right answer.

If the target site uses JS to form the request, you can either run JS or analyze what it does so that you can do the same in the language you use.

It’s even easier to use a tool that will capture the resulting query for you, regardless of whether the JS query is created or not, then you can create your cleanup code to create the desired query.

The AT & T Web Slip Tool is such a tool.

You set it up, and then go to the website as usual to go to the page you want to clear, and WSP will log all requests and answers for you.

It writes them in the form of Perl code, which you can then modify to suit your needs.

+1
source

All Articles