JSoup parsing invalid HTML with private tags

The use of JSoup inclusive in the latest version 1.7.2 parses an invalid HTML error with private tags .

Example:

String tmp = "<a href='www.google.com'>Link<p>Error link</a>"; Jsoup.parse(tmp); 

The document that is generated:

 <html> <head></head> <body> <a href="www.google.com">Link</a> <p><a>Error link</a></p> </body> </html> 

Browsers will generate something like:

 <html> <head></head> <body> <a href="www.google.com">Link</a> <p><a href="www.google.com">Error link</a></p> </body> </html> 

Jsoup should work as browsers or as source code.

Is there any solution? Looking into the API, I did not find anything.

+8
java html-parsing web-crawler jsoup
source share
3 answers

The correct behavior is to act like other browsers when parsing this invalid HTML. Thank you for reporting this error . I fixed a problem that prevented the adoption agency from retaining the original attributes in the new node. It will be available in 1.7.3, or you can now build from the head.

+5
source share

If your goal is to get source code similar to creating browsers, you can use selenium and then pass it to Jsoup for parsing. but selenium must open a real browser, of course, it can open it automatically. Code like this:

 public static void main(String[] args) { //System.setProperty("webdriver.chrome.driver", "./chromedriver.exe"); //WebDriver driver = new ChromeDriver(); WebDriver driver = new FirefoxDriver(); driver.get("file:///C:/Users/jgong/Desktop/a.html"); String html = driver.getPageSource(); System.out.println(html); driver.quit(); Document doc = Jsoup.parse(html); System.out.println(doc.html()); } 

and a.html:

 <html><head></head><body><a href="www.google.com">Link<p>Error link</a></body></html> 

and the result is that you wanted:

 <html><head></head> <body> <a href="www.google.com">Link</a><p><ahref="www.google.com">Error link</a> </p></body></html> 
+2
source share

Your HTML is invalid

the document type does not allow the use of the "P" element missing one of "APPLET", "OBJECT", "MAP", "IFRAME", "BUTTON" start-tag

 <a href='www.google.com'>Link<p>Error link</a> 

The specified element is not allowed to appear in the context in which you placed it; the other elements mentioned are the only ones that are allowed there and may contain the element. This may mean that you need a containing element or you may have forgotten to close the previous element.

One possible reason for this message is that you tried to place a block level element (for example, "<p>" or "<table>") inside an inline element (for example, "<a>", "<span>" or "<font>").

There is no standard way to fix faulty HTML, and every different parser will try its best. If you want duplicate results for invalid HTML, you must stick to exactly the same version of the same parser.

0
source share

All Articles