How to manage the "pool" of PhantomJS instances

Question

How to manage the "pool" of PhantomJS instances

I am planning a web service for my own use internally, which takes a single argument, a URL, and returns an html representing the resolved DOM from that URL. By permission, I mean that webservice will first get the page at this URL, and then use PhantomJS to "render" the page, and then return the resulting source after all DHTML, AJAX calls, etc. Will be completed. However, running phantom for each request (which I am doing now) is too slow. I would rather have a pool of PhantomJS instances, one of which is always available to serve the last call to my web service.

Has any work been done on this species before? I would rather base this web service on the work of others than write a pool manager / HTTP proxy for myself from scratch.

More context . I have listed 2 similar projects that I have seen so far below, and why I avoided each of them, which raises the question of managing the instance pool of PhantomJS.

jsdom - from what I saw, it has excellent functionality for executing scripts on the page, but it does not try to replicate the behavior of the browser, so if I used it as a universal "DOM resolver", in the end, there is a lot of extra coding for handling all edge cases, event triggering, etc. In the first example I saw, you need to manually call the onload () function of the body tag for the test application that I installed using node. It seemed the beginning of a deep rabbit hole.

Selenium - it just has a lot more moving parts, so setting up a pool to manage long-lasting browser instances will be harder than using PhantomJS. I do not need any benefits for recording macros / scripts. I just want a web service that is just as efficient at retrieving a web page and resolving its DOM as if I were viewing this URL using a browser (or even faster if I can make it ignore images, etc. d.).

+65

node.js web-scraping phantomjs jsdom

Trindaz Apr 01 2018-12-12T00:

source share

6 answers

JasonS · Answer 1 · 2013-10-28 04:30

I install the PhantomJs cloud service, and that pretty much does what you ask for. It took me about 5 weeks to work.

The biggest problem you'll encounter is the known memory leak problem in PhantomJs . The way I worked was to loop my instances every 50 calls.

The second most important problem that you will encounter is processing on one page, very intensive with the processor and memory, so you can only run 4 instances per processor.

The third most important issue you'll encounter is that PhantomJs is pretty scary of events and page redirection. You will be informed that your page has finished rendering before it actually is. There are several ways to deal with this , but nothing "standard", unfortunately.

The fourth biggest problem you'll have to deal with is the interop between nodejs and phantomjs, fortunately there are lots of npm packages that deal with this problem to choose from.

So, I know that I am biased (since I wrote the solution that I am going to offer), but I suggest you check out PhantomJsCloud.com , which is free to use light.

January 2015 Update: Another (5th) issue that I encountered is how to send a request / response from the manager / load -balancer. I initially used the PhantomJS built-in HTTP server, but I ran into limitations all the time, especially regarding the maximum response size. As a result, I wrote a request / response to the local file system as a communication line. * The total time spent on the implementation of the service is perhaps 20 person-week questions, possibly 1000 hours of work. * and FYI. I am doing full correspondence for the next version .... (in progress)

Michelle Tilley · Answer 2 · 2012-04-18 05:21

the asynchronous JavaScript library runs in Node and has a queue function, which is very convenient for this kind of thing:

queue(worker, concurrency)
Creates a queue object with the specified concurrency. Tasks added to the queue will be processed in parallel (to the limit of concurrency). If all employees are in the process of execution, the task is queued until it is available. Once the worker completes the task, the task callback is called.

Some pseudo codes:

 function getSourceViaPhantomJs(url, callback) { var resultingHtml = someMagicPhantomJsStuff(url); callback(null, resultingHtml); } var q = async.queue(function (task, callback) { // delegate to a function that should call callback when it done // with (err, resultingHtml) as parameters getSourceViaPhantomJs(task.url, callback); }, 5); // up to 5 PhantomJS calls at a time app.get('/some/url', function(req, res) { q.push({url: params['url_to_scrape']}, function (err, results) { res.end(results); }); });

View all documentation for queue in the readme file .

Thomas Dondorf · Answer 3 · 2015-12-02 19:07

For my master's thesis, I developed the phantomjs-pool library that does just that. It allows you to provide jobs that are then displayed on PhantomJS workers. The library handles job distribution, communications, error handling, logging, restarting, and some other things. The library has been successfully used to scan over a million pages.

Example:

The following code searches Google for numbers 0 through 9 and saves the screenshot of the page as googleX.png. Four websites are crawled in parallel (due to the creation of four workers). The script is launched through node master.js .

master.js (works in Node.js environment)

 var Pool = require('phantomjs-pool').Pool; var pool = new Pool({ // create a pool numWorkers : 4, // with 4 workers jobCallback : jobCallback, workerFile : __dirname + '/worker.js', // location of the worker file phantomjsBinary : __dirname + '/path/to/phantomjs_binary' // either provide the location of the binary or install phantomjs or phantomjs2 (via npm) }); pool.start(); function jobCallback(job, worker, index) { // called to create a single job if (index < 10) { // index is count up for each job automatically job(index, function(err) { // create the job with index as data console.log('DONE: ' + index); // log that the job was done }); } else { job(null); // no more jobs } }

worker.js (works in PhantomJS environment)

 var webpage = require('webpage'); module.exports = function(data, done, worker) { // data provided by the master var page = webpage.create(); // search for the given data (which contains the index number) and save a screenshot page.open('https://www.google.com/search?q=' + data, function() { page.render('google' + data + '.png'); done(); // signal that the job was executed }); };

TTT · Answer 4 · 2015-03-25 11:22

As an alternative to @JasonS 's excellent answer, you can try the PhearJS I created. PhearJS is a supervisor written in NodeJS for PhantomJS instances and provides an API over HTTP. It is available open source from Github .

Shawn Liu · Answer 5 · 2015-11-10 09:13

if you use nodejs why not use selenium-webdriver

run some phantomjs instance as webdriver phantomjs --webdriver=port_number

for each phantomjs instance create a PhantomInstance

 function PhantomInstance(port) { this.port = port; } PhantomInstance.prototype.getDriver = function() { var self = this; var driver = new webdriver.Builder() .forBrowser('phantomjs') .usingServer('http://localhost:'+self.port) .build(); return driver; }

and put all of them into one array [phantomInstance1, phantomInstance2]

create dispather.js that get the free phantomInstance from the array and
```
 var driver = phantomInstance.getDriver(); 
```

Fred B · Answer 6 · 2012-10-28 21:45

If you use nodejs, you can use https://github.com/sgentle/phantomjs-node , which will allow you to associate an arbitrary number of phantomjs processes with your main one. Thus, the NodeJS process allows you to use async.js and many node goodies.

How to manage the "pool" of PhantomJS instances

More articles: