Question c. did you have headless chrome or chrome in the cloud functions of Firebase ... no answer! since the node.js project will not have access to chrome / chromium executables and therefore will not work! (TRUST ME - I PASSED!).
The best solution is to use the Phantom npm package, which uses PhantomJS under the hood: https://www.npmjs.com/package/phantom
Documents and information can be found here:
http://amirraminfar.com/phantomjs-node/#/
or
https://github.com/amir20/phantomjs-node
The site on which I was trying to crawl implemented screen cleaning software, the trick is to wait for the page to load by searching for the expected line or matching the regular expression, i.e. I am doing a regular expression for a if you need any regular expression difficulties for you - contact https://AppLogics.uk/ - starting at Β£ 5 (GPB).
here is a typewriter snippet for calling http or https:
const phantom = require('phantom'); const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']); const page: any = await instance.createPage(); const status = await page.open('https://somewebsite.co.uk/'); const content = await page.property('content');
again in JavaScript:
const phantom = require('phantom'); const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']); const page = yield instance.createPage(); const status = yield page.open('https://somewebsite.co.uk/'); const content = yield page.property('content');
This is an easy bit! if its static page is pretty much done, and you can parse the HTML as a cheerio npm package: https://github.com/cheeriojs/cheerio is an implementation of the jQuery core for servers!
However, if it is a page with dynamic loading, that is, lazy loading or even scrambling methods, you will need to wait for the page to refresh by looping and calling page.property('content') and starting a text search or regex to find out if your page has ended .
I created a generic asynchronous function that returns the contents of the page (as a string) on ββsuccessful launch and throws an exception on failure or timeout. Variables for the page, text (search string indicating success), error (string to indicate failure or null, so as not to check for errors), and timeout (number by itself) are used as parameters as parameters:
Typescript:
async function waitForPageToLoadStr(page: any, text: string, error: string, timeout: number): Promise<string> { const maxTime = timeout ? (new Date()).getTime() + timeout : null; let html: string = ''; html = await page.property('content'); async function loop(): Promise<string>{ async function checkSuccess(): Promise <boolean> { html = await page.property('content'); if (!isNullOrUndefined(error) && html.includes(error)) { throw new Error('Error string found: ${ error }'); } if (maxTime && (new Date()).getTime() >= maxTime) { throw new Error('Timed out waiting for string: ${ text }'); } return html.includes(text) } if (await checkSuccess()){ return html; } else { return loop(); } } return await loop(); }
JavaScript:
function waitForPageToLoadStr(page, text, error, timeout) { return __awaiter(this, void 0, void 0, function* () { const maxTime = timeout ? (new Date()).getTime() + timeout : null; let html = ''; html = yield page.property('content'); function loop() { return __awaiter(this, void 0, void 0, function* () { function checkSuccess() { return __awaiter(this, void 0, void 0, function* () { html = yield page.property('content'); if (!isNullOrUndefined(error) && html.includes(error)) { throw new Error('Error string found: ${error}'); } if (maxTime && (new Date()).getTime() >= maxTime) { throw new Error('Timed out waiting for string: ${text}'); } return html.includes(text); }); } if (yield checkSuccess()) { return html; } else { return loop(); } }); } return yield loop(); }); }
I personally used this function as follows:
Typescript:
try { const phantom = require('phantom'); const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']); const page: any = await instance.createPage(); const status = await page.open('https://somewebsite.co.uk/'); await waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000); } catch (error) { console.error(error); }
JavaScript:
try { const phantom = require('phantom'); const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']); const page = yield instance.createPage(); yield page.open('https://vehicleenquiry.service.gov.uk/'); yield waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000); } catch (error) { console.error(error); }
Happy crawling!