Parsing an html document using an XML parser

Is it possible to parse an HTML file using an XML parser?

Why (t) I can do it. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.

The intended use is to create an HTML parser that is part of a web crawler application.

+5
source share
3 answers

You can try parsing the HTML file with an XML parser, but it will most likely fail. The reason is that HTML documents may have the following HTML functions that XML parsers do not understand.

  • which never have end tags , and which do not use the XML so-called "self-closing syntax tag"; e.g. <br> , <meta> , <link> and <img> (also known as voids)
  • which do not require end tags ; e.g. <p> <dt> <li> (their end tags may be implied)
  • which may contain unescaped markup characters " < "; e.g. style , textarea , title , script ; <script> if (a < b) … </script> , <title>Using the "<" operator</title>
  • with unspecified values; e.g. <meta charset=utf-8 >
  • attributes are empty , without any separate value; e.g. <input disabled >

An XML parser will not be able to parse any HTML document that uses any of these functions.

On the other hand, the HTML parser will in principle never fail, no matter what the document contains.


All that was said was also done to develop a new type of XML parsing analysis, the so-called XML5 parsing , which can handle things like empty / nonquoted attributes even in XML documents. There is a draft XML5 specification , as well as an XML5 parser, xml5ever .


The intended use is to create an HTML parser that is part of a web crawler application

If you intend to create a web crawler application, you should absolutely use an HTML parser and, ideally, an HTML parser that complies with the requirements of parse5 (node.js / JavaScript)

html5lib (python) html5ever (rust) validator.nu html5 parser (java) gumbo (c, bindings for ruby, target c, C ++, per, php, C #, perl, lua, D, julia ... )
+6
source

syntactically they are almost identical

Computers are picky. "Almost identical" is not enough. HTML assumes that XML does not, so the XML parser will reject (many, though not all) HTML documents.

In addition, there is another culture of quality. With HTML, the culture for parsing is "try to do something with input, if possible." With the XML culture, "if it is erroneous, send it in for repair or replacement."

+5
source

XML parsers will stop as soon as the XML content is correctly generated.
Some XML rules do not apply to HTML (such as illegal characters), so any XML parser will treat your HTML as unformed and will not continue.

Consider the following HTML page:

 <!doctype html> <html> <head><title>Test</title></head> <body> <input type="checkbox" name="azerty" checked /> <p>if A=B & B>D, then A>D</p> </body> </html> 

This is a well-formed and valid HTML, as you can check the W3C validator (validator.w3.org).

Now try checking the following XML (e.g. http://www.xmlvalidation.com ):

 <?xml version="1.0"?> <html> <head><title>Test</title></head> <body> <input type="checkbox" name="azerty" checked /> <p>if A=B & B>D, then A>D</p> </body> </html> 

You will be informed that it is not formed by XML, since the checked attribute is not accompanied by an equal sign and value.
Correct this, then you will be told that '&' is an illegal symbol. Replace it with the corresponding entity &amp; then you will find out that '>' also illegal.

The tool you are trying to parse HTML as XML will probably find some kind of error of this kind. As soon as it finds the first one, it stops processing your poorly formed XML document.

You will still have a chance if the HTML page you are trying to parse is well formed XHTML 1.0 strict or XHTML 1.1 ...

+3
source

All Articles