XML parsers will stop as soon as the XML content is correctly generated.
Some XML rules do not apply to HTML (such as illegal characters), so any XML parser will treat your HTML as unformed and will not continue.
Consider the following HTML page:
<!doctype html> <html> <head><title>Test</title></head> <body> <input type="checkbox" name="azerty" checked /> <p>if A=B & B>D, then A>D</p> </body> </html>
This is a well-formed and valid HTML, as you can check the W3C validator (validator.w3.org).
Now try checking the following XML (e.g. http://www.xmlvalidation.com ):
<?xml version="1.0"?> <html> <head><title>Test</title></head> <body> <input type="checkbox" name="azerty" checked /> <p>if A=B & B>D, then A>D</p> </body> </html>
You will be informed that it is not formed by XML, since the checked attribute is not accompanied by an equal sign and value.
Correct this, then you will be told that '&' is an illegal symbol. Replace it with the corresponding entity & then you will find out that '>' also illegal.
The tool you are trying to parse HTML as XML will probably find some kind of error of this kind. As soon as it finds the first one, it stops processing your poorly formed XML document.
You will still have a chance if the HTML page you are trying to parse is well formed XHTML 1.0 strict or XHTML 1.1 ...
source share