Did you consider JTidy ?
JTidy is the Tidy HTML port of Java, an HTML syntax checker and pretty printer. Like a non-Java cousin, JTidy can be used as a tool to clean up invalid and erroneous HTML. In addition, JTidy provides a DOM for real HTML code.
Obviously, at some point it will struggle with HTML depending on how badly formed it is, but you may find that this works for you.
source share