Crawling links on a page, and then viewing and checking each link using node and zombie.js

Question

Crawling links on a page, and then viewing and checking each link using node and zombie.js

I am trying to create a simple utility in Node with zombie.js to visit the page, find and open all the links on the page and make sure that each child page successfully returns 200.

Here is an example of this code (written in CoffeeScript), bypassing the stackoverflow.com homepage

Browser = require('zombie') browserOpts = runScripts: false site: 'http://www.stackoverflow.com' home = new Browser browserOpts home.visit '/', (e, browser) -> questions = browser.queryAll '#question-mini-list .summary h3 a' for q in questions qUrl = q.getAttribute 'href' page = new Browser browserOpts page.visit qUrl, (e, browser, statusCode, errors) -> console.log "Arrived at page #{browser.window.location} and found " + browser.html().length + " bytes" console.log statusCode browser.dump() return return

If you try to run this code, you will notice that the first part of the links is loaded correctly, and the number of bytes on the page is displayed.

However, after the first batch of successful page loads (the size of which seems random), all subsequent page loads seem to end the visit callback prematurely. The document is empty (it's just <html><head></head><body></body></html> ), and the statusCode argument for the callback is undefined .

I cannot explain or understand why this is happening. Any advice would be greatly appreciated.

+4

node.js zombie.js

Tharsan Mar 18 '13 at 21:04

source share

1 answer

generalhenry · Answer 1 · 2013-04-12T21:12:39+0000

Sorry my js for the coffeescript question

 var async = require('async'); var Browser = require('zombie'); var browserOpts = { runScripts: false, site: 'http://www.stackoverflow.com' }; var home = new Browser(browserOpts); home.visit('/', function(e, browser) { var questions = browser.queryAll('#question-mini-list .summary h3 a'); async.eachLimit(questions, 3, function (question, cb) { var qUrl = question.getAttribute('href'); var page = new Browser(browserOpts); page.visit(qUrl, function(e, browser, statusCode, errors) { console.log(("Arrived at page " + browser.window.location + " and found ") + browser.html().length + " bytes"); console.log(statusCode); browser.dump(); cb(e); }); }, function (err) { console.error('OOPS', err); }); });

try here: http://runnable.com/UWh05t96qlJ8AAAC

You make too many requests at the same time, and stackoverflow disconnects you. As far as I can tell, cuttoff is 4.

If you really need data from stackoverflow, use the api: https://api.stackexchange.com/docs

Crawling links on a page, and then viewing and checking each link using node and zombie.js

More articles: