Nokogiri is based on libxml2, which is one of the fastest XML / HTML parsers in any language. It is written in C, but there are bindings in many languages.
The problem is that the more complex the file, the more time it takes to create the complete DOM structure in memory. Creating a DOM is slower and more hungry than other parsing methods (typically, the entire DOM should fit into memory). XPath relies on this DOM.
SAX is often used by people to speed up or for large documents that do not fit into memory. It depends more on the event: it notifies you of the start element, end element, etc., and you process the handlers to respond to them. This is a bit of a pain because you end up tracking the state of yourself (for example, which elements you are “inside”).
There is a midpoint: some parsers have the ability to "parse traction" when you use navigation with a pointer. You still visit each node sequentially, but you can “fast forward” to the end of an element that you are not interested in. He got the SAX speed, but the best interface for many applications. I don't know if Nokogiri can do this for HTML, but I would look into the Reader API if you're interested.
Note that Nokogiri is also very lenient with incorrect markup (e.g. real HTML), and this alone makes it a very good choice for parsing HTML.
Mark thomas
source share