Casperjs iterates over a list of links using casper.each

I am trying to use Casperjs to get a list of links from a page, then open each of these links and add an object of a specific data type from these pages to the array.

The problem I am facing is the loop that runs on each of the items in the list.

First I get listOfLinks from the source page. This part works and uses length, I can check that this list is full.

However, using the this.each loop this.each , as shown below, none of the console statements ever appears, and casperjs skips this block.

Replacing this.each standard for the loop, execution only partially passes through the first link, since the "Create a new array in object for x.html" statement appears once, and then the code stops execution. Using IIFE does not change this.

Edit: in verbose debug mode:

 Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true 

For some reason, the URL that is passed to the thenOpen function changes to empty ...

I feel that there is something in Casperjs asynchronous nature that I don’t understand here, and would be grateful for pointing out a working example.

 casper.then(function () { var date = Date.now(); console.log(date); var object = {}; object[date] = {}; // new object for date var listOfLinks = this.evaluate(function(){ console.log("getting links"); return document.getElementsByClassName('importantLink'); }); console.log(listOfLinks.length); this.each(listOfLinks, function(self, link) { var eachPageHref = link.href; console.log("Creating new array in object for " + eachPageHref); object[date][eachPageHref] = []; // array for page to store names self.thenOpen(eachPageHref, function () { var listOfItems = this.evaluate(function() { var items = []; // Perform DOM manipulation to get items return items; }); }); object[date][eachPageHref] = items; }); console.log(JSON.stringify(object)); }); 
+8
javascript phantomjs casperjs
source share
3 answers

I decided to use our own Stackoverflow.com as a demo site to run your script. A few minor things have been fixed in your code, and the result is the exercise of getting comments from PhantomJS bounty questions.

 var casper = require('casper').create(); casper .start() .open('http://stackoverflow.com/questions/tagged/phantomjs?sort=featured&pageSize=30') .then(function () { var date = Date.now(), object = {}; object[date] = {}; var listOfLinks = this.evaluate(function(){ // Getting links to other pages to scrape, this will be // a primitive array that will be easily returned from page.evaluate var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) { return link.href; }); return links; }); // Now to iterate over that array of links this.each(listOfLinks, function(self, eachPageHref) { object[date][eachPageHref] = []; // array for page to store names self.thenOpen(eachPageHref, function () { // Getting comments from each page, also as an array var listOfItems = this.evaluate(function() { var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) { return comment.innerText; }); return items; }); object[date][eachPageHref] = listOfItems; }); }); // After each links has been scraped, output the resulting object this.then(function(){ console.log(JSON.stringify(object)); }); }) casper.run(); 

What has changed: page.evaluate now returns the simple arrays that are needed for casper.each () to correctly iterate. href attributes are retrieved immediately on the page. evaluate. Also this amendment:

  object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope 

The result of running the script is

 {"1478596579898":{"http://stackoverflow.com/questions/40410927/phantomjs-from-node-on-windows":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days ago\n","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days ago\n","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterday\n"],"http://stackoverflow.com/questions/40412726/casperjs-iterating-over-a-list-of-links-using-casper-each":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterday\n"]}} 
+3
source share

You are returning DOM nodes in the evaluate() function, which is not valid. Instead, you can return the actual URLs.

Note. The arguments and return value of the evaluation function should be a simple primitive object. A rule of thumb: if it can be serialized via JSON, then this is normal.

Closures, functions, DOM nodes, etc. will not work!

Link: PhantomJS#evaluate

+3
source share

If I understand your problem correctly, solve it by giving the [] elements a global scope. In your code, I would do the following:

 var items = []; this.each(listOfLinks, function(self, link) { var eachPageHref = link.href; console.log("Creating new array in object for " + eachPageHref); object[date][eachPageHref] = []; // array for page to store names self.thenOpen(eachPageHref, function () { this.evaluate(function() { // Perform DOM manipulation to get items items.push(whateverThisItemIs); }); }); 

Hope this helps.

+1
source share

All Articles