How to parse an HTML page using Node.js

I need to parse (server side) a large number of HTML pages.
We all agree that regexp is not the way to go here. It seems to me that javascript is a native way to parse an HTML page, but this assumption depends on the server-side code that has all the javascript in the DOM in the browser.

Does Node.js have a built-in ability? Is there a better approach to this problem, server side HTML parsing?

+69
html-parsing server-side
Sep 10 '11 at 16:18
source share
6 answers

You can use the npm modules jsdom and htmlparser to create and parse the DOM in Node.JS.

Other options:

  • BeautifulSoup for python
  • you can convert html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • JS engines with spidermonkey and rhino have built-in E4X support. It can be useful only if you convert your html to xhtml.

Of all these parameters, I prefer to use the Node.js parameter because it uses the standard W3C DOM access methods, and I can reuse the code on both the client and the server. I would like BeautifulSoup methods to be more like W3C dom, and I think converting your HTML to XHTML for XSLT writing is just sadistic.

+66
Sep 10 '11 at 16:24
source share

Use Cheerio . It is not as strict as jsdom, and is optimized for cleaning. As a bonus, jQuery selectors that you already know are used.

❀ Familiar syntax: Cheerio implements a subset of the jQuery core. hello removes all the inconsistencies of the DOM and hacking the browser from the jQuery library, revealing its truly magnificent API.

ϟ Surprisingly fast: Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end tests show that cheerio is about 8 times faster than JSDOM.

❁ Insanely flexible: Cheerio wraps around @ FB55 forgiving HTMLparser. Cheerio can parse almost any HTML or XML document.

+52
Nov 12 '13 at 16:36
source share

Use htmlparser2 , its way is faster and quite simple. Refer to this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And a live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/

+7
Nov 28 '14 at 12:04 on
source share

Htmlparser2 from FB55 seems like a good alternative.

+4
Apr 20 '13 at 18:09
source share

jsdom is too strict to make any real screen scrapers, but beautifulsoup doesn't choke on bad markup.

node-soupselect is the python beautifulsoup port in nodejs and it works great

+1
Aug 24 '13 at 11:40
source share

.NET has the HTML Agility Pack , which is an extremely robust HTML parsing library.

0
Sep 10 2018-11-11T00:
source share



All Articles