Using XPath to extract XOM elements from documents with unnecessary namespaces

Question

Using XPath to extract XOM elements from documents with unnecessary namespaces

I am trying to parse the HTML returned by an external system using XOM. HTML looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <body> <div> Help I am trapped in a fortune cookie factory </div> </body> </html>

(Actually, this is significantly messy, but it has this DOCTYPE declaration and these namespace and language declarations, and the above HTML shows the same problem as real HTML.)

What I want to do is extract the contents of the <div> , but the namespace declaration seems to confuse XPath. If I struck out the namespace declaration (manually, from the file), the following code will find the <div> , no problem:

 Document document = ... Nodes divs = document.query("//div");

But with the namespace, the returned Nodes is 0 in size.

Well, what about hiding the namespace programmatically?

 Element rootElement = document.getRootElement(); rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix());

... it looks like it should work, but does nothing. From javadoc :

This method only removes additional namespaces added with addNamespaceDeclaration.

Ok, I thought, I provided the namespace in the request:

 XPathContext context = XPathContext.makeNamespaceContext(document.getRootElement()); Nodes divs = document.query("//div", context);

The size is still zero.

How about creating a namespace context manually?

 XPathContext context = context = new XPathContext( rootElement.getNamespacePrefix(), rootElement.getNamespaceURI()); Nodes divs = document.query("//div", context);

The XPathContext constructor explodes:

 nu.xom.NamespaceConflictException: XPath expressions do not use the default namespace

So I'm looking for:

a way to make this request work, or
a way to programmatically split namespace declarations or
an explanation of the correct approach, assuming that they are both wrong.

Update: Based on the answer of Lev Levitsky and the Jaxen FAQ I came up with the following hack:

 XPathContext context = new XPathContext( "foo", document.getRootElement().getNamespaceURI()); Nodes divs = document.query("//foo:div");

It still seems a little crazy to me, but I guess Jaxen wants you to do something.

Update No. 2: As indicated below and across the Internet , this is not a Jaxen bug; it's just XPath being XPath.

So, while this hack is working, I would still like to remove the namespace declaration. Preferably, short of XSLT.

+2

xpath xml-namespaces xom

David moles Mar 12 '12 at 19:33

source share

2 answers

You can write:

 Nodes divs = document.query("//*[local-name()='div' and namespace-uri()='http://www.w3.org/1999/xhtml']");

+2

peter.murray.rust Apr 2 '13 at 23:17

source share

Lev levitsky · Accepted Answer · 2012-03-12T20:16:22+0000

You should either specify a namespace directly with something like

 Nodes divs = document.query("//{http://www.w3.org/1999/xhtml}div");

or using prefixes that map to the corresponding namespaces (I think NamespaceContext used for this, but there are no prefixes in your request).

Unfortunately, I don't know how this is implemented in Java, but I can provide a Python example if that helps.

Using XPath to extract XOM elements from documents with unnecessary namespaces

More articles: