Performing XML retrievals using XSLT without having to read the entire DOM tree into memory?

Question

Performing XML retrievals using XSLT without having to read the entire DOM tree into memory?

I have a situation where I want to extract some information from very large, but regular XML files (I just needed to do this with a 500 MB file) and where XSLT would be ideal.

Unfortunately, those XSLT implementations that I know of (except for the most expensive version of Saxon) do not support reading only the necessary part of the DOM, but reading the entire tree. This causes the computer to be replaced to death.

Defined by XPath

//m/e[contains(.,'foobar')

therefore it is, in fact, only grep.

Is there an XSLT implementation that can do this? Or, an XSLT implementation that provides the appropriate “tip” can do this trick to clip parts in memory that are no longer needed?

I would prefer a Java implementation, but both Windows and Linux are viable native platforms.

EDIT: The input XML is as follows:

 <log> <!-- Fri Jun 26 12:09:27 CEST 2009 --> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Registering Catalina:type=Manager,path=/axsWHSweb-20090626,host=localhost</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Force random number initialization starting</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Getting message digest component for algorithm MD5</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Completed getting message digest component</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>getDigest() 0</m></e> ...... </log>

Essentialy I want to select some m-nodes (and I know that XPath is wrong for this, it was just a quick hack), but keep the XML layout.

EDIT: It seems that STX may be what I am looking for (I can live with another conversion language), and that Joost is its implementation. Any experiences?

EDIT: I found that Saxon 6.5.4 with -Xmx1500m can load my XML, so this allowed me to use my XPaths right now. This is just a good touch, so I would still like to solve this problem in the general case - this means that the script, which, in turn, means the absence of manual Java filtering.

EDIT: Oh by the way. This is a log file very similar to what is generated by log4j XMLLayout. The reason XML is able to do just that, namely make queries in the log. This is an initial attempt, therefore, a simple question. Later, I would like to ask more complex questions, so I would like the query language to be able to process the input file.

+4

java xml xslt streaming stx

Thorbjørn Ravn Andersen Dec 17 '09 at 13:42

source share

10 answers

Balusc · Answer 1 · 2009-12-17T13:56:41+0000

Consider VTD-XML . It is much more efficient to use memory. You can find the API here and the tests here .

Note that the last graph says that the DOM uses at least 5 times more memory than a large XML file. It’s really amazing, isn't it?

As a bonus, it is also faster in parsing and Xpath, unlike the DOM and JDK:

_{(source: sourceforge.net )}

Chris dail · Answer 2 · 2009-12-17T13:46:34+0000

You can implement this without a full table scan. The operator // means searching for an item in the tree at any level. It is very expensive to work, especially on a document of your size. If you optimize your XPath query or consider tuning match patterns, the XSLT transformer may not need to load the entire document into memory.

Based on your XML sample, you are looking for a match / log / e / m [... predicate ...]. This should be optimized by some XSLT processors so as not to scan a complete document where // it will not.

Since your XML document is fairly simple, it may be easier to not use XSLT at all. STaX is a great streaming API for processing large XML documents. Dom4j also has good XPath query support for large documents. Information on using dom4j for large documents is here: http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

An example from the source above:

 SAXReader reader = new SAXReader(); reader.addHandler( "/ROWSET/ROW", new ElementHandler() { public void onStart(ElementPath path) { // do nothing here... } public void onEnd(ElementPath path) { // process a ROW element Element row = path.getCurrent(); Element rowSet = row.getParent(); Document document = row.getDocument(); ... // prune the tree row.detach(); } } ); Document document = reader.read(url); // The document will now be complete but all the ROW elements // will have been pruned. // We may want to do some final processing now ...

Robert Christie · Answer 3 · 2009-12-17T14:27:23+0000

The Enterprise Edition Saxon XSLT Processor supports large document streams for this type of problem.

Bulat · Answer 4 · 2013-01-14T12:14:27+0000

I had the same problem and didn’t want to write Java code. I was able to solve this problem using STX via Joost.

According to spec :

the STX process can split a large XML document into smaller fragments, transfer each of these fragments to an external filter (for example, an XSLT processor), and combine the results into a large result XML document.

This is exactly what I need. The biggest example of an XML file that I have is 1.5 GB, and I had an XSLT template to handle it. When using the free version of Saxon during processing, it consumes more than 3 GB . Joost took less than 90 MB .

My XML file contains a large list of products, and each of them has a complex XML structure. Therefore, I did not want to reinstall XSLT in STX, but I just wanted to split the processing into each product, using the same XSLT for each product.

Here are the details of the code, hope it will be useful for someone.

Source XSLT file (this was the first XSLT I implemented, so sorry for the poor use of for-each statements):

 <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions"> <xsl:template match="/"> <xsl:for-each select="Products/Product"> <!-- Some XSL statements relative to "Product" element --> </xsl:for-each> </xsl:template> </xsl:stylesheet>

I converted it to the following STX:

 <?xml version="1.0" encoding="UTF-8"?> <stx:transform version="1.0" output-method="text" output-encoding="UTF-8" xmlns:stx="http://stx.sourceforge.net/2002/ns"> <stx:buffer name="xslt-product"> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions"> <xsl:template match="Product"> <!-- The same XSL statements relative to "Product" element --> </xsl:template> </xsl:stylesheet> </stx:buffer> <stx:template match="/"> <stx:process-children /> </stx:template> <stx:template match="Product"> <stx:process-self filter-method="http://www.w3.org/1999/XSL/Transform" filter-src="buffer(xslt-product)" /> </stx:template> </stx:transform>

When I started Joost, I still had to add Saxon libraries, since I use functions in my XSLT, so I need XSLT 2.0 support. As a result, the command to start the conversion was as follows:

 java -Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl -cp joost.jar:commons-discovery-0.5.jar:commons-logging-1.1.1.jar:saxon9he.jar net.sf.joost.Main my-source.xml my-convert.stx

The bottom line is that now I can run the conversion on low memory servers without implementing any Java code or re-implementing the original XSLT rules!

Carl Smotricz · Answer 5 · 2009-12-17T13:50:46+0000

This is a blow in the dark, and perhaps you will laugh at me.

Nothing prevents you from connecting the SAX source to the input of your XSLT; and at least in theory it’s enough to just make your grep from a SAX stream without the need for a DOM. So ... want to try to try?

bill seacham · Answer 6 · 2009-12-17T14:13:27+0000

Try the CAX parser from xponentsoftware. This is a fast XML parser built on Microsoft xmlreader. It gives the full path when parsing each element, so you can check if there is a path = "m / e" and then check if the text contains node "foo"

Robert rossney · Answer 7 · 2009-12-18T07:19:50+0000

I am not a Java guy, and I don’t know if the tools that I will use for this in .NET will have analogues in the Java world.

To solve this problem in .NET, I would infer the class from XmlReader and return only those elements that interest me. Then I can use XmlReader as input for any XML object, for example, XmlDocument or XslCompiledTransform . The XmlReader subclass basically preprocesses the input stream, making it look like a much smaller XML document than any class uses to read it.

It seems that the technique described here is similar. But I, as I say, not a Java guy.

xcut · Answer 8 · 2009-12-18T13:29:05+0000

STX contains a stream subset of XPath, which I think is called STXPath; I have to remember, because I co-authored a specification :-)

You can definitely choose Joost and extract the appropriate bits, but note that STX is not widely recognized in the industry, so you need to do some due diligence regarding the current stability and support of the tool.

David Roussel · Answer 9 · 2010-06-15T09:26:35+0000

You can do this through STX / Joost, as already suggested, but note that many XSLT implementations have SAX streaming mode and do not need to store everything in memory. You just need to make sure your XSLT file is not looking on any of the wrong axis.

However, if I were you and really wanted performance, I would do it in STaX. It is simple, standard and fast. It goes out of the box in java 6, although you can also use Woodstox for a slightly better implementation.

For the xpath you specified, the implementation is trivial. The downside is that you have more code to support, and it's not as expressive and high-level as XPath, as in Joost or XSLT.

sahilsahadevan · Answer 10 · 2019-03-08T19:23:56+0000

Write xslt to return the values in your preferred xml layout, containing only the values you need from largeXmls.

However, if you want to further process the values in Java, then:

convert this simple xml to POJO and read the values (preferred option)
use regex to retrieve values

Example of using StreamSource to parse xml through xslt:

Package used:

 import javax.xml.transform.Source; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerException; import javax.xml.transform.TransformerFactory; import javax.xml.transform.stream.StreamResult; import javax.xml.transform.stream.StreamSource; import java.io.File; import java.io.StringReader; import java.io.StringWriter;

The code:

  String xmlStr = "<A><b>value</b><c>value</c></A>"; File xslt = new ClassPathResource("xslt/Transformer.xslt").getFile(); Source xsltSource = new StreamSource(xslt); Source xmlSource = new StreamSource(new StringReader(xmlStr)); TransformerFactory transformerFactory = TransformerFactory.newInstance(); Transformer transformer = transformerFactory.newTransformer(xsltSource); StringWriter stringWriter = new StringWriter(); transformer.transform(xmlSource, new StreamResult(stringWriter)); String response = stringWriter.toString();

Performing XML retrievals using XSLT without having to read the entire DOM tree into memory?

More articles: