How to parse an HTML page using Node.js

Question

How to parse an HTML page using Node.js

I need to parse (server side) a large number of HTML pages.
We all agree that regexp is not the way to go here. It seems to me that javascript is a native way to parse an HTML page, but this assumption depends on the server-side code that has all the javascript in the DOM in the browser.

Does Node.js have a built-in ability? Is there a better approach to this problem, server side HTML parsing?

+69

node.js html-parsing server-side

Itay Moav -Malimovka Sep 10 '11 at 16:18

source share

6 answers

Use Cheerio . It is not as strict as jsdom, and is optimized for cleaning. As a bonus, jQuery selectors that you already know are used.

❤ Familiar syntax: Cheerio implements a subset of the jQuery core. hello removes all the inconsistencies of the DOM and hacking the browser from the jQuery library, revealing its truly magnificent API.
ϟ Surprisingly fast: Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end tests show that cheerio is about 8 times faster than JSDOM.
❁ Insanely flexible: Cheerio wraps around @ FB55 forgiving HTMLparser. Cheerio can parse almost any HTML or XML document.

+52

Meekohi Nov 12 '13 at 16:36

source share

Use htmlparser2 , its way is faster and quite simple. Refer to this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And a live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/

+7

Anderson Madeira Nov 28 '14 at 12:04 on

source share

Htmlparser2 from FB55 seems like a good alternative.

+4

esp Apr 20 '13 at 18:09

source share

jsdom is too strict to make any real screen scrapers, but beautifulsoup doesn't choke on bad markup.

node-soupselect is the python beautifulsoup port in nodejs and it works great

+1

Yarek T Aug 24 '13 at 11:40

source share

.NET has the HTML Agility Pack , which is an extremely robust HTML parsing library.

0

josh3736 Sep 10 2018-11-11T00:

source share

kzh · Accepted Answer · 2011-09-10 16:24

You can use the npm modules jsdom and htmlparser to create and parse the DOM in Node.JS.

Other options:

BeautifulSoup for python
you can convert html to xhtml and use XSLT
HTMLAgilityPack for .NET
CsQuery for .NET (my new favorite)
JS engines with spidermonkey and rhino have built-in E4X support. It can be useful only if you convert your html to xhtml.

Of all these parameters, I prefer to use the Node.js parameter because it uses the standard W3C DOM access methods, and I can reuse the code on both the client and the server. I would like BeautifulSoup methods to be more like W3C dom, and I think converting your HTML to XHTML for XSLT writing is just sadistic.

How to parse an HTML page using Node.js

More articles: