Save and create a web page using PhantomJS and node.js

I am looking for an example of requesting a webpage, expecting to render JavaScript (JavaScript modifies the DOM) and then capturing the HTML page.

This should be a simple example with an obvious use case for PhantomJS. I can't find a decent example, the documentation seems to be about using the command line.

+58
javascript html web-scraping phantomjs
Apr 01 '12 at 18:01
source share
6 answers

From your comments, I think you have 2 options

  1. Try to find the phantomjs node module - https://github.com/amir20/phantomjs-node
  2. Run phantomjs as a child process inside the node - http://nodejs.org/api/child_process.html

Edit:

It seems that phantomjs offers a child process to interact with the host, see Faq - http://code.google.com/p/phantomjs/wiki/FAQ

Edit:

Example Phantomjs script to get HTML page layout:

var page = require('webpage').create(); page.open('http://www.google.com', function (status) { if (status !== 'success') { console.log('Unable to access network'); } else { var p = page.evaluate(function () { return document.getElementsByTagName('html')[0].innerHTML }); console.log(p); } phantom.exit(); }); 
+42
Apr 02 2018-12-12T00:
source share

With v2, phantomjs-node pretty easy to print HTML after it has been processed.

 var phantom = require('phantom'); phantom.create().then(function(ph) { ph.createPage().then(function(page) { page.open('https://stackoverflow.com/').then(function(status) { console.log(status); page.property('content').then(function(content) { console.log(content); page.close(); ph.exit(); }); }); }); }); 

This will show the output as it would be presented using a browser.

Edit 2019:

You can use async/await :

 const phantom = require('phantom'); (async function() { const instance = await phantom.create(); const page = await instance.createPage(); await page.on('onResourceRequested', function(requestData) { console.info('Requesting', requestData.url); }); const status = await page.open('https://stackoverflow.com/'); const content = await page.property('content'); console.log(content); await instance.exit(); })(); 

Or, if you just want to check, you can use npx

 npx phantom@latest https://stackoverflow.com/ 
+8
Mar 15 '16 at 18:26
source share

In the past, I used two different ways, including the page.evaluate () method, which queries the DOM that Declan mentions. Another way I passed the information from the webpage is to push it into console.log (), and in phantomjs script use:

 page.onConsoleMessage = function (msg, line, source) { console.log('console [' +source +':' +line +']> ' +msg); } 

I could also grab the msg variable in onConsoleMessage and look for some encapsulation data. Depends on how you want to use the output.

Then in the Nodejs script you will need to scan the output of the Phantomjs script:

 var yourfunc = function(...params...) { var phantom = spawn('phantomjs', [...args]); phantom.stdout.setEncoding('utf8'); phantom.stdout.on('data', function(data) { //parse or echo data var str_phantom_output = data.toString(); // The above will get triggered one or more times, so you'll need to // add code to parse for whatever info you're expecting from the browser }); phantom.stderr.on('data', function(data) { // do something with error data }); phantom.on('exit', function(code) { if (code !== 0) { // console.log('phantomjs exited with code ' +code); } else { // clean exit: do something else such as a passed-in callback } }); } 

Hope this helps.

+4
May 31 '12 at 20:21
source share

Why not just use it?

 var page = require('webpage').create(); page.open("http://example.com", function (status) { if (status !== 'success') { console.log('FAIL to load the address'); } else { console.log('Success in fetching the page'); console.log(page.content); } phantom.exit(); }); 
+3
Dec 18 '13 at 16:07
source share

Later update in case someone stumbles upon this question:

The GitHub project developed by my colleague aims to help you with this: https://github.com/vmeurisse/phantomCrawl .

He is still a little young, of course, he lacks any documentation, but the above example should help to perform a basic workaround.

+1
Jun 26 '13 at 16:10
source share

The old version is used here, using running node, express and phantomjs, which save the page as .png. You can configure it pretty fast to get html.

https://github.com/wehrhaus/sitescrape.git

+1
Apr 26 '14 at 3:18
source share



All Articles