How to process invalid HTML documents from the Internet with a library that requires the correct HTML code

I get this error while parsing a website. ERROR: "The declaration for the ContentType must end with">. Or the input type must be closed

+4
source share
1 answer

Did you consider JTidy ?

JTidy is the Tidy HTML port of Java, an HTML syntax checker and pretty printer. Like a non-Java cousin, JTidy can be used as a tool to clean up invalid and erroneous HTML. In addition, JTidy provides a DOM for real HTML code.

Obviously, at some point it will struggle with HTML depending on how badly formed it is, but you may find that this works for you.

+2
source

All Articles