What are fast XML parsers for Ruby?

I use Nokogiri, which works well for small documents. But for the 180KB HTML file, I need to increase the size of the process stack via ulimit -s , and the parsing and XPath requests take a lot of time.

Are there faster methods using the Ruby distribution in stock?

I'm used to XPath, but the solution does not have to support XPath.

Criteria:

  • Write fast.
  • Fast execution.
  • Reliable final parser.
+6
source share
5 answers

Nokogiri is based on libxml2, which is one of the fastest XML / HTML parsers in any language. It is written in C, but there are bindings in many languages.

The problem is that the more complex the file, the more time it takes to create the complete DOM structure in memory. Creating a DOM is slower and more hungry than other parsing methods (typically, the entire DOM should fit into memory). XPath relies on this DOM.

SAX is often used by people to speed up or for large documents that do not fit into memory. It depends more on the event: it notifies you of the start element, end element, etc., and you process the handlers to respond to them. This is a bit of a pain because you end up tracking the state of yourself (for example, which elements you are “inside”).

There is a midpoint: some parsers have the ability to "parse traction" when you use navigation with a pointer. You still visit each node sequentially, but you can “fast forward” to the end of an element that you are not interested in. He got the SAX speed, but the best interface for many applications. I don't know if Nokogiri can do this for HTML, but I would look into the Reader API if you're interested.

Note that Nokogiri is also very lenient with incorrect markup (e.g. real HTML), and this alone makes it a very good choice for parsing HTML.

+6
source

Check out the pearl of Ox. It is faster than LibXML and Nokogiri, and also supports memory fault analysis, as well as SAX callback analysis. Full disclosure, I wrote this.


In performance comparison, http://www.ohler.com/software/thoughts/Blog/Entries/2011/9/21_XML_with_Ruby.html compares both the DOM (in memory) and SAX (callback).

+15
source
+2
source

You may find that for large XML documents, parsing the DOM is not very efficient. This is because the analyzer must build a memory map in the structure of the XML document.

Another approach, which usually requires less memory, is to use an event-driven SAX analyzer.

Nokogiri has full SAX support.

0
source

Depending on your environment, Oga may be better suited as a reasonably fast XML parser for Ruby with a much more user-friendly interface and faster installation time.

0
source

All Articles