Can the Html Agility Pack be used to parse HTML snippets?

I need to get LINK and META elements on ASP.NET pages, user controls and master pages, grab their contents, and then write the updated values โ€‹โ€‹to these files in the utility I'm working on.

I could try using regular expressions to capture only these elements, but there are a few problems with this approach:

  • I expect that many input files will contain broken HTML (missing / missing sequence elements, etc.).
  • SCRIPT elements containing comments and / or VBScript / JavaScript that look like valid elements, etc.
  • I need to have IE special conditional comments and META and LINK elements inside IE conditional comments
  • Not to mention that HTML is not an ordinary language.

I did some research for HTML parsers in .NET, and many SO posts and blogs recommend the HTML Agility Pack . I have never used it before, and I donโ€™t know if it can parse broken HTML and HTML snippets. (For example, imagine a user control that contains only a HEAD element with some content in it - no HTML or BODY .) I know that I can read the documentation, but that would save me quite a bit of time if anyone could advise. (Most SO posts include parsing full HTML pages.)

+7
source share
2 answers

Absolutely, this is what he surpasses.

In fact, many of the web pages you find in the wild can be described as HTML snippets due to the lack of <html> tags or incorrectly closed tags.

HtmlAgilityPack mimics what a browser should do - try to understand what is sometimes a mess of inappropriate tags. Not enough science, but HtmlAgilgityPack does it very well.

+5
source

An alternative to the Html Agility Pack is CsQuery , the CQ jQuery port, of which I am the main author. It allows you to use the CSS selector and the full request API to access and manage the DOM, which is easier for many than XPATH. In addition, this HTML parser is designed specifically for various purposes and there are several options for parsing HTML: as a complete document ( html, body tags will be added, and any lost content moves inside the body); as a content block (this means that it will not be wrapped as a complete document, but additional tags are added, such as tbody , which are still required in the DOM, as well as browsers), and as a true fragment, where there is no tags are created (for example, if you just work with building blocks).

See creating a new DOM for details.

In addition, the CsQuery HTML parser was designed with HTML5 specifications for additional closing tags. For example, closing p tags is optional, but there are certain rules that determine when a block should be closed. To create the same DOM as the browser, the parser must implement the same rules. CsQuery does this to provide a high degree of browser DOM compatibility for a given source.

Using CsQuery is very simple, for example.

 CQ docFromString = CQ.Create(htmlString); CQ docFromWeb = CQ.CreateFromUrl(someUrl); // there are other methods for asynchronous web gets, creating from files, streams, etc. // css selector: the indexer [] is like jQuery $(..) CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"]; // Text() is a jQuery method returning text contents of selection string textOfCell = lastCellInFirstRow.Text(); 

Finally, CsQuery indexes documents by class selector, id, attribute, and tag-making very quickly compared to the Html Agility Pack.

+5
source

All Articles