Using XPath to extract XOM elements from documents with unnecessary namespaces

I am trying to parse the HTML returned by an external system using XOM. HTML looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <body> <div> Help I am trapped in a fortune cookie factory </div> </body> </html> 

(Actually, this is significantly messy, but it has this DOCTYPE declaration and these namespace and language declarations, and the above HTML shows the same problem as real HTML.)

What I want to do is extract the contents of the <div> , but the namespace declaration seems to confuse XPath. If I struck out the namespace declaration (manually, from the file), the following code will find the <div> , no problem:

 Document document = ... Nodes divs = document.query("//div"); 

But with the namespace, the returned Nodes is 0 in size.

Well, what about hiding the namespace programmatically?

 Element rootElement = document.getRootElement(); rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix()); 

... it looks like it should work, but does nothing. From javadoc :

This method only removes additional namespaces added with addNamespaceDeclaration.

Ok, I thought, I provided the namespace in the request:

 XPathContext context = XPathContext.makeNamespaceContext(document.getRootElement()); Nodes divs = document.query("//div", context); 

The size is still zero.

How about creating a namespace context manually?

 XPathContext context = context = new XPathContext( rootElement.getNamespacePrefix(), rootElement.getNamespaceURI()); Nodes divs = document.query("//div", context); 

The XPathContext constructor explodes:

 nu.xom.NamespaceConflictException: XPath expressions do not use the default namespace 

So I'm looking for:

  • a way to make this request work, or
  • a way to programmatically split namespace declarations or
  • an explanation of the correct approach, assuming that they are both wrong.

Update: Based on the answer of Lev Levitsky and the Jaxen FAQ I came up with the following hack:

 XPathContext context = new XPathContext( "foo", document.getRootElement().getNamespaceURI()); Nodes divs = document.query("//foo:div"); 

It still seems a little crazy to me, but I guess Jaxen wants you to do something.


Update No. 2: As indicated below and across the Internet , this is not a Jaxen bug; it's just XPath being XPath.

So, while this hack is working, I would still like to remove the namespace declaration. Preferably, short of XSLT.

+2
source share
2 answers

You should either specify a namespace directly with something like

 Nodes divs = document.query("//{http://www.w3.org/1999/xhtml}div"); 

or using prefixes that map to the corresponding namespaces (I think NamespaceContext used for this, but there are no prefixes in your request).

Unfortunately, I don't know how this is implemented in Java, but I can provide a Python example if that helps.

+1
source

You can write:

 Nodes divs = document.query("//*[local-name()='div' and namespace-uri()='http://www.w3.org/1999/xhtml']"); 
+2
source

All Articles