Is there an XmlReader equivalent for HTML in .NET?

Question

Is there an XmlReader equivalent for HTML in .NET?

I used HtmlAgilityPack in the past to parse HTML in .Net, but I don't like the fact that it only uses the DOM.

In large documents and / or with heavy levels of nesting, you can use stack overflow or memory exception. In addition, in the general case, the DOM-based parsing model uses significantly more memory than the stream approach, usually because a process that wants to consume HTML may require only a few elements that will be available at a time.

Does anyone know of a decent HTML parser for .Net that allows you to parse HTML just like the XmlReader class? i.e. in direct forward mode

+4

html .net parsing xmlreader html-agility-pack

Robv Jun 23 '11 at 10:15

source share

2 answers

Mike mooney · Answer 1 · 2011-06-23T10:24:37+0000

I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader

Like other users, there is a problem in that HTML does not follow the same well-formed XML rules, so it’s hard to see, but SgmlReader usually does a pretty good job.

jgauffin · Answer 2 · 2011-06-23T10:20:40+0000

The problem is that HTML may be garbled. And you cannot know in which tag the end tag is missing (or which tags are placed in the wrong order) until you have analyzed most of the document.

If the documents you have analyzed are well-formed, why not use XmlReader ?

Is there an XmlReader equivalent for HTML in .NET?

More articles: