Groovy - NekoHTML Sax parser

Question

Groovy - NekoHTML Sax parser

It’s not easy for me to work with the NekoHTML parser. It works fine at the url, but when I want to test a simple XML test, it doesn't read it correctly.

Here's how I declare it:

def createAndSetParser() { SAXParser parser = new SAXParser() //Default Sax NekoHTML parser def charset = "Windows-1252" // The encoding of the page def tagFormat = "upper" // Ensures all the tags and consistently written, by putting all of them in upper-case. We can choose "lower", "upper" of "match" def attrFormat = "lower" // Same thing for attributes. We can choose "upper", "lower" or "match" Purifier purifier = new Purifier() //Creating a purifier, in order to clean the incoming HTML XMLDocumentFilter[] filter = [purifier] //Creating a filter, and adding the purifier to this filter. (NekoHTML feature) parser.setProperty("http://cyberneko.org/html/properties/filters", filter) parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset) parser.setProperty("http://cyberneko.org/html/properties/names/elems", tagFormat) parser.setProperty("http://cyberneko.org/html/properties/names/attrs", attrFormat) parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) // Forces the parser to use the charset we provided to him. parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) // To let the Doctype as it is. parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) // To make sure no namespace is added or overridden. parser.setFeature("http://cyberneko.org/html/features/balance-tags", true) return new XmlSlurper(parser) // A groovy parser that does not download the all tree structure, but rather supply only the information it is asked for. }

Again it works very well when I use it on websites. Any guess why I can't do this on plain XML texts?

Any help greatly appreciated :)

+4

html xml groovy saxparser

Alexandre Bourlier Aug 16 '11 at 15:03

source share

1 answer

stefanglase · Answer 1 · 2011-12-20T20:01:04+0000

I made your script executable in the Groovy console to easily try using Grape to extract the desired NekoHTML library from the Maven Central repository.

 @Grapes( @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15') ) import groovy.xml.StreamingMarkupBuilder import org.apache.xerces.xni.parser.XMLDocumentFilter import org.cyberneko.html.parsers.SAXParser import org.cyberneko.html.filters.Purifier def createAndSetParser() { SAXParser parser = new SAXParser() parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[]) parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252") parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper") parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower") parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) parser.setFeature("http://cyberneko.org/html/features/balance-tags", true) return new XmlSlurper(parser) } def printResult(def gPathResult) { println new StreamingMarkupBuilder().bind { out << gPathResult } } def parser = createAndSetParser() printResult parser.parseText('<html><body>Hello World</body></html>') printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')

When executing this method, the result of two printResult -statements looks like below, and can explain your problems by parsing the XML string because it is wrapped in <html><body>...</body></html> tags and loses the root tag <house/> :

 <HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML> <HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>

All this is called by the function http://cyberneko.org/html/features/balance-tags , which you included in your script. If I disable this function (it must be explicitly set to false, since by default it is true), the results look like this:

 <HTML><BODY>Hello World</BODY></HTML> <HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>

Groovy - NekoHTML Sax parser

More articles: