XML parsing

I am trying to load a fragment of (possibly) incorrect HTML into an XMLDocument object, but it does not work with XMLExceptions ... since there are additional open / close tags and invalid XML tags, such as <img > instead of <img />

How to get XML to analyze all errors in the data? Is there any XML validation mechanism that I can apply before parsing to fix these errors? Or will it handle exception parsing that can be parsed?

+6
c # xml parsing xml-parsing xmldocument
source share
6 answers

HTML Agility Pack will parse html, not xhtml, and is pretty forgiving. The object model will be familiar if you used XmlDocument .

+14
source share

You might want to check the answer to this question .

Basically, somewhere between the beautifulsoup .NET port and the HTML agility package there is a way.

+2
source share

It is not true that you can create an XmlDocument with this level of distorted structure. XmlDocument (as far as I know) requires that the xml content conform to the correct nesting and closing syntax.

However, you suspect that instead you can parse this with XmlReader. It can still throw exceptions if some glaring errors are encountered, but according to MSDN docs, it can at least reveal the location of the errors.

If you're just working with HTML, there is an HTML Agility Pack that can serve your purpose.

+1
source share

Depending on your specific needs, you can use HTML Tidy to clean the document and then import it using the XMLDocument object.

+1
source share

What you are trying to do is very difficult. HTML cannot be parsed using an XML parser because XML is strict and HTML is not. If this HTML was compatible with XHTML (HTML as XML), then the XML parser parsed HTML without problems.

You might want to see if there are any HTML to XHTML converters if you really want to use an XML parser for HTML.

In other words, I have yet to meet an XML parser that processes garbled XML ... they are not intended to accept scattered markup, such as HTML (for good reason too :))

0
source share

You cannot load invalid XML in an XmlDocument .

Check out the HTML Agility Pack on CodePlex

0
source share

All Articles