The use of JSoup inclusive in the latest version 1.7.2 parses an invalid HTML error with private tags .
Example:
String tmp = "<a href='www.google.com'>Link<p>Error link</a>"; Jsoup.parse(tmp);
The document that is generated:
<html> <head></head> <body> <a href="www.google.com">Link</a> <p><a>Error link</a></p> </body> </html>
Browsers will generate something like:
<html> <head></head> <body> <a href="www.google.com">Link</a> <p><a href="www.google.com">Error link</a></p> </body> </html>
Jsoup should work as browsers or as source code.
Is there any solution? Looking into the API, I did not find anything.
java html-parsing web-crawler jsoup
Javier salinas
source share