How to convert huge XML files to java?

As the name says, I have a huge xml file (GB)

<root> <keep> <stuff> ... </stuff> <morestuff> ... </morestuff> </keep> <discard> <stuff> ... </stuff> <morestuff> ... </morestuff> </discard> </root> 

and I would like to convert it to a much smaller one that only retains a few elements.
My parser should do the following:
1. Analyze the file until the corresponding item appears.
2. Copy the entire corresponding element (with children) to the output file. go to 1.

step 1 is simple with SAX and impossible for DOM parsers.
step 2 is annoying SAX, but easy with DOM-Parser or XSLT.

So what? - Is there an easy way to combine SAX and DOM-Parser to complete this task?

+7
java xml parsing
source share
7 answers

Yes, just write a SAX content handler, and when it encounters a specific element, you build a house tree on that element. I did this with very large files and it works very well.

In fact, it is very simple: as soon as you come across the beginning of the necessary element, you set the flag in the content handler, and from there you send everything to the DOM constructor. When you encounter the end of an element, you set the flag to false and write out the result.

(For more complex cases with nested elements of the same element name, you need to create a stack or counter, but this is still pretty easy to do.)

+9
source share

StAX seems like one obvious solution: it's a parser for pulling, not a “push” SAX or “buffer” all this is a “DOM approach. I can't say I used it. A “ StAX Tutorial ”may be useful :)

+10
source share

I did a great job with STX ( Stream Conversion for XML ). In essence, this is a streaming version of XSLT, well suited for parsing huge amounts of data with minimal memory. It has a Java implementation called Joost .

It’s nice to come up with an STX transform that ignores all the elements until the element matches the given XPath, does not copy this element and all its children (using the identification template in the template group) and continues to ignore the elements until the next match.

UPDATE

I hacked into an STX transformation that does what I understand you want. This mainly depends on STX functions, such as template groups and custom default templates.

 <stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns" version="1.0" pass-through="none" output-method="xml"> <stx:template match="element/child"> <stx:process-self group="copy" /> </stx:template> <stx:group name="copy" pass-through="all"> </stx:group> </stx:transform> 

pass-through="none" in stx:transform sets up default templates (for nodes, attributes, etc.) so as not to produce output, but to process child elements. Then stx:template matches XPath element/child (this is the place where you put your match expression), it "processes itself" in the "copy" group, which means that the corresponding template from group name="copy" is called on the current element. This group has pass-though="all" , so templates by default copy their input and processed children. When the element/child finished, control passes back to the template, which calls process-self , and the following elements are ignored again. Until the pattern matches again.

The following is an example of an input file:

 <root> <child attribute="no-parent, so no copy"> </child> <element id="id1"> <child attribute="value1"> text1<b>bold</b> </child> </element> <element id="id2"> <child attribute="value2"> text2 <x:childX xmlns:x="http://x.example.com/x"> <!-- comment --> yet more<bi="i" x:i="xi" ></b> </x:childX> </child> </element> </root> 

This is the corresponding output file:

 <?xml version="1.0" encoding="UTF-8"?> <child attribute="value1"> text1<b>bold</b> </child><child attribute="value2"> text2 <x:childX xmlns:x="http://x.example.com/x"> <!-- comment --> yet more<bi="i" x:i="xi" /> </x:childX> </child> 

Unusual formatting is the result of skipping text nodes containing newline strings outside of child elements.

+5
source share

Since you are talking about GB, I would prefer the priority of memory usage in consideration. SAX requires about 2 times the memory, since the document is large, and the DOM at least 5 times. Therefore, if your XML file is 1 GB in size, then the DOM will require at least 5 GB of free memory. It's not funny anymore. Thus, SAX (or any option on it, such as StAX) is the best option here.

If you want to use the most efficient memory approach, check out VTD-XML . It requires only a little more memory than a large file.

+3
source share

Take a look at StAX , this might be what you need. There is a good introduction to IBM Developer Works .

+2
source share

For such a large XML document, something with a streaming architecture like Omnimark would be ideal.

There should not have been anything complicated. Omnimark script, like the one below, can give you what you need:

 process submit #main-input macro upto (arg string) is ((lookahead not string) any)* macro-end find (("<keep") upto ("</keep>") "</keep>")=>keep output keep find any 
+2
source share

You can do this quite easily using XMLEventReader and several XMLEventWriter from javax.xml.stream package.

0
source share

All Articles