I have a situation where I want to extract some information from very large, but regular XML files (I just needed to do this with a 500 MB file) and where XSLT would be ideal.
Unfortunately, those XSLT implementations that I know of (except for the most expensive version of Saxon) do not support reading only the necessary part of the DOM, but reading the entire tree. This causes the computer to be replaced to death.
Defined by XPath
//m/e[contains(.,'foobar')
therefore it is, in fact, only grep.
Is there an XSLT implementation that can do this? Or, an XSLT implementation that provides the appropriate “tip” can do this trick to clip parts in memory that are no longer needed?
I would prefer a Java implementation, but both Windows and Linux are viable native platforms.
EDIT: The input XML is as follows:
<log> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Registering Catalina:type=Manager,path=/axsWHSweb-20090626,host=localhost</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Force random number initialization starting</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Getting message digest component for algorithm MD5</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>Completed getting message digest component</m></e> <eh='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'> <m>getDigest() 0</m></e> ...... </log>
Essentialy I want to select some m-nodes (and I know that XPath is wrong for this, it was just a quick hack), but keep the XML layout.
EDIT: It seems that STX may be what I am looking for (I can live with another conversion language), and that Joost is its implementation. Any experiences?
EDIT: I found that Saxon 6.5.4 with -Xmx1500m can load my XML, so this allowed me to use my XPaths right now. This is just a good touch, so I would still like to solve this problem in the general case - this means that the script, which, in turn, means the absence of manual Java filtering.
EDIT: Oh by the way. This is a log file very similar to what is generated by log4j XMLLayout. The reason XML is able to do just that, namely make queries in the log. This is an initial attempt, therefore, a simple question. Later, I would like to ask more complex questions, so I would like the query language to be able to process the input file.