Clean real-time web pages with Node.js

Question

Clean real-time web pages with Node.js

What is the use of cleaning website content with Node.js. I would like to build something very, very fast to do a kayak.com style search , where one request is sent to several different sites, the results are scraped and returned to the client as they appear.

Suppose this script just needs to provide the results in JSON format, and we can process them either directly in the browser or in another web application.

A few starting points:

Using Node.js and jquery to clean sites

Does anyone have any idea?

+63

javascript jquery node.js web-scraping screen-scraping

Avishai Mar 06 2018-11-11T00:

source share

8 answers

Avishai · Answer 1 · 2011-03-12 15:24

Node.io seems to be taking the cake :-)

Yevgeniy · Answer 2 · 2012-07-14 15:44

All of the above solutions require the use of a scraper locally. This means that you will be severely limited in performance (due to running them in a sequence or in a limited set of threads). The best approach, imho, is to rely on an existing, albeit commercial, cleaning net.

Here is an example:

var bobik = new Bobik("YOUR_AUTH_TOKEN"); bobik.scrape({ urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'], queries: ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"] }, function (scraped_data) { if (!scraped_data) { console.log("Data is unavailable"); return; } var scraped_urls = Object.keys(scraped_data); for (var url in scraped_urls) console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]); });

Here, the cleaning is performed remotely, and a callback is issued to your code only when the results are ready (it is also possible to collect results as they appear).

You can download the Bobik client client proxy at https://github.com/emirkin/bobik_javascript_sdk

electblake · Answer 3 · 2013-06-03 23:49

I did research myself, and https://npmjs.org/package/wscraper boasts

web scraper agent based on cheerio.js, fast, flexible and lean jQuery core implementation; built based on request.js; inspired by http-agent.js

Very low usage (according to npmjs.org), but worth a look for all interested parties.

daithi44 · Answer 4 · 2012-04-24 21:16

You do not always need jQuery. If you play with the DOM returned from jsdom, for example, you can easily take what you need (also given that you don't need to worry about problems with xbrowser.) See: https://gist.github.com/1335009 , which does not detract from node.io at all, just saying that you can do it yourself depending on ...

Evan Carroll · Answer 5 · 2016-05-31 02:17

New way to use ES7 / promises

Usually when you clean, you want to use some kind of method to

Get a resource on a web server (usually in an html document)
Read this resource and work with it as
- DOM / tree structure and make it accessible.
- parse it as a token document with something like SAS.

Both wood and marker parsing have advantages, but wood is usually much simpler. We will do it. Check out the request-promise , here is how it works:

 const rp = require('request-promise'); const cheerio = require('cheerio'); // Basically jQuery for node.js const options = { uri: 'http://www.google.com', transform: function (body) { return cheerio.load(body); } }; rp(options) .then(function ($) { // Process html like you would with jQuery... }) .catch(function (err) { // Crawling failed or Cheerio

This uses cheerio , which is essentially a server-side jQuery-esque lightweight library (which doesn't need a window object, or jsdom).

Since you are using promises, you can also write this in an asynchronous function. It will look synchronous, but it will be asynchronous with ES7:

 async function parseDocument() { let $; try { $ = await rp(options); } catch (err) { console.error(err); } console.log( $('title').text() ); // prints just the text in the <title> }

harish2704 · Answer 6 · 2014-05-19 05:25

This is my easy-to-use universal scraper https://github.com/harish2704/html-scrapper , written for Node.JS. It can extract information based on predefined schemes. Highlighting a circuit includes a css selector and a data extraction function. He currently uses cheerio for parsing at home.

user3723412 · Answer 7 · 2014-06-09 18:20

check https://github.com/rc0x03/node-promise-parser

 Fast: uses libxml C bindings Lightweight: no dependencies like jQuery, cheerio, or jsdom Clean: promise based interface- no more nested callbacks Flexible: supports both CSS and XPath selectors

griffith_joel · Answer 8 · 2017-07-01 17:34

I see that most answers to the correct path are using cheerio , etc., however, as soon as you get to the point where you need to parse and execute JavaScript (ala SPA and others), I would check https: // github. com / joelgriffith / navalia (I'm the author). Navalia is designed to support scraping in the context of a mute browser, and it's pretty fast. Thank!

Clean real-time web pages with Node.js

New way to use ES7 / promises

More articles: