I made your script executable in the Groovy console to easily try using Grape to extract the desired NekoHTML library from the Maven Central repository.
@Grapes( @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15') ) import groovy.xml.StreamingMarkupBuilder import org.apache.xerces.xni.parser.XMLDocumentFilter import org.cyberneko.html.parsers.SAXParser import org.cyberneko.html.filters.Purifier def createAndSetParser() { SAXParser parser = new SAXParser() parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[]) parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252") parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper") parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower") parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) parser.setFeature("http://cyberneko.org/html/features/balance-tags", true) return new XmlSlurper(parser) } def printResult(def gPathResult) { println new StreamingMarkupBuilder().bind { out << gPathResult } } def parser = createAndSetParser() printResult parser.parseText('<html><body>Hello World</body></html>') printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')
When executing this method, the result of two printResult -statements looks like below, and can explain your problems by parsing the XML string because it is wrapped in <html><body>...</body></html> tags and loses the root tag <house/> :
<HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML> <HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>
All this is called by the function http://cyberneko.org/html/features/balance-tags , which you included in your script. If I disable this function (it must be explicitly set to false, since by default it is true), the results look like this:
<HTML><BODY>Hello World</BODY></HTML> <HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>
source share