I did a great job with STX ( Stream Conversion for XML ). In essence, this is a streaming version of XSLT, well suited for parsing huge amounts of data with minimal memory. It has a Java implementation called Joost .
It’s nice to come up with an STX transform that ignores all the elements until the element matches the given XPath, does not copy this element and all its children (using the identification template in the template group) and continues to ignore the elements until the next match.
UPDATE
I hacked into an STX transformation that does what I understand you want. This mainly depends on STX functions, such as template groups and custom default templates.
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns" version="1.0" pass-through="none" output-method="xml"> <stx:template match="element/child"> <stx:process-self group="copy" /> </stx:template> <stx:group name="copy" pass-through="all"> </stx:group> </stx:transform>
pass-through="none" in stx:transform sets up default templates (for nodes, attributes, etc.) so as not to produce output, but to process child elements. Then stx:template matches XPath element/child (this is the place where you put your match expression), it "processes itself" in the "copy" group, which means that the corresponding template from group name="copy" is called on the current element. This group has pass-though="all" , so templates by default copy their input and processed children. When the element/child finished, control passes back to the template, which calls process-self , and the following elements are ignored again. Until the pattern matches again.
The following is an example of an input file:
<root> <child attribute="no-parent, so no copy"> </child> <element id="id1"> <child attribute="value1"> text1<b>bold</b> </child> </element> <element id="id2"> <child attribute="value2"> text2 <x:childX xmlns:x="http://x.example.com/x"> yet more<bi="i" x:i="xi" ></b> </x:childX> </child> </element> </root>
This is the corresponding output file:
<?xml version="1.0" encoding="UTF-8"?> <child attribute="value1"> text1<b>bold</b> </child><child attribute="value2"> text2 <x:childX xmlns:x="http://x.example.com/x"> yet more<bi="i" x:i="xi" /> </x:childX> </child>
Unusual formatting is the result of skipping text nodes containing newline strings outside of child elements.
Christian semrau
source share