Fix open tags in html or parse using HTML parser for XSLT conversion

I have HTML code that is the result of an XSLT conversion. (XML-> HTML)

I want to run another conversion from XSLT to HTML result. (HTML-> HTML)

My problem is that the first conversion may return private tags such as " <img> ", which means that I cannot parse the html result with DocumentBuilder because it uses SAXparser, and of course my html file is not valid xml in all cases. (I get an exception that the next XY tag must be closed.)

I guess there are two solutions.

  • Or fix the HTML result by closing closed tags.

  • Use some kind of HTML parser to get a valid org.w3c.dom.Document and skip the XML parsers like SAX.

I would really like to use basically the same method that I used for the first conversion, so I would prefer that one of the solutions to the above problem is that I can not find any obvious third-party banks that can help. (Although I looked.) So basically I would like to know what my options are, are there any solutions to this problem?

Any help would be greatly appreciated.

+4
source share
3 answers

alt text TagSoup - Just Keep On Truckin ' alt text

You can use TagSoup so that all documents are well formed.

... a SAX-compatible parser, written in Java that instead of parsing or valid XML, parses HTML as it is found in the wild: poor, nasty and cruel, although quite often far from short.

TagSoup is intended for people who need to process this material using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command line processor that reads HTML files and can generate either pure HTML or well-formed XML, which is a close approximation to XHTML.

If you use Saxon, you can make TagSoup your parser by adding the following parameter :

... you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup has included your Java classpath.

I used this to parse and convert HTML documents in a single pass and found that it works great. He will read the document as a well-formed XHTML document, accessible for manipulation and transformation through XML tools.

In addition, Taggle, TagSoup in C ++, is now available.

+4
source

You need Jsoup : Java HTML Parser . It has functionality to output neat HTML.

 String html = "<p>The recurrence, in close succession <ul><li>list item 1</li><li>list item 2</li></ul> second part of thisssss"; String clean = Jsoup.clean(html, Whitelist.relaxed()); 

You can also use other Whitelist .

+5
source

You need to remove your XML. Try this library:

http://jtidy.sourceforge.net/

0
source

All Articles