I am trying to parse the HTML returned by an external system using XOM. HTML looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <body> <div> Help I am trapped in a fortune cookie factory </div> </body> </html>
(Actually, this is significantly messy, but it has this DOCTYPE declaration and these namespace and language declarations, and the above HTML shows the same problem as real HTML.)
What I want to do is extract the contents of the <div> , but the namespace declaration seems to confuse XPath. If I struck out the namespace declaration (manually, from the file), the following code will find the <div> , no problem:
Document document = ... Nodes divs = document.query("//div");
But with the namespace, the returned Nodes is 0 in size.
Well, what about hiding the namespace programmatically?
Element rootElement = document.getRootElement(); rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix());
... it looks like it should work, but does nothing. From javadoc :
This method only removes additional namespaces added with addNamespaceDeclaration.
Ok, I thought, I provided the namespace in the request:
XPathContext context = XPathContext.makeNamespaceContext(document.getRootElement()); Nodes divs = document.query("//div", context);
The size is still zero.
How about creating a namespace context manually?
XPathContext context = context = new XPathContext( rootElement.getNamespacePrefix(), rootElement.getNamespaceURI()); Nodes divs = document.query("//div", context);
The XPathContext constructor explodes:
nu.xom.NamespaceConflictException: XPath expressions do not use the default namespace
So I'm looking for:
- a way to make this request work, or
- a way to programmatically split namespace declarations or
- an explanation of the correct approach, assuming that they are both wrong.
Update: Based on the answer of Lev Levitsky and the Jaxen FAQ I came up with the following hack:
XPathContext context = new XPathContext( "foo", document.getRootElement().getNamespaceURI()); Nodes divs = document.query("//foo:div");
It still seems a little crazy to me, but I guess Jaxen wants you to do something.
Update No. 2: As indicated below and across the Internet , this is not a Jaxen bug; it's just XPath being XPath.
So, while this hack is working, I would still like to remove the namespace declaration. Preferably, short of XSLT.