Java: how to split XML stream into small XML documents? XPath for XML streaming parser?

I need to read a large XML document from the web and split it into smaller XML documents. In particular, the stream I'm reading from the network looks something like this:

<a> <b> ... </b> <b> ... </b> <b> ... </b> <b> ... </b> .... </a>

I need to break it into pieces

<a> <b> ... </b> <a>

(I really need the <b> .... </b> parts if the names of the name objects declared above (for example, in <a> ) move to <b> if that makes it easier).

The file is too large for the DOM style parser, it must be streaming. Is there an XML library that can do this?

[change]

I think that what I am ideally looking for is something like the ability to make XPath queries in an XML stream, where the stream analyzer only analyzes how necessary it is to return the next element to the node set result (and all its attributes and children ) It is not necessary to be XPath, but something on this idea.

Thanks!

+4
source share
5 answers

JAXP SAX api with SAX filter is fast and efficient. Good built-in filters can be seen here

+2
source

As an XML delimiter, VTD-XML is ideally suited for this task ... it is also more memory efficient than the DOM. The key coding simplification method is VTDNav getElementFragment () ... below is the Java code for split input.xml in out0.xml and out1.xml

 <a> <b> text1 </b> <b> text2 </b> </a> 

in

 <a> <b> text1</b> </a> 

and

 <a> <b> text2</b> </a> 

using XPath

 /a/b 

The code

 import java.io.*; import com.ximpleware.*; public class split { public static void main(String[] argv) throws Exception{ VTDGen vg = new VTDGen(); if (vg.parseFile("c:/split/input.xml", true)){ VTDNav vn = vg.getNav(); AutoPilot ap = new AutoPilot(vn); ap.selectXPath("/a/b"); int i=-1,k=0; byte[] ba = vn.getXML().getBytes(); while((i=ap.evalXPath())!=-1){ FileOutputStream fos = new FileOutputStream("c:/split/out"+k+".xml"); fos.write("<a>".getBytes()); long l = vn.getElementFragment(); fos.write(ba, (int)l, (int)(l>>32)); fos.write("</a>".getBytes()); k++; } } } } 

For further reading, please visit http://www.devx.com/xml/Article/36379

+1
source

go to old school

 StringBuilder buffer = new StringBuilder(1024 * 50); BufferedReader reader = new BufferedReader(new FileReader(pstmtout)); String line; while ((line = reader.readLine()) != null) { buffer.append(line); if (line.equalsIgnoreCase(endStatementTag)) { service.handle(buffer.toString()); buffer.delete(0, buffer.length()); } } 
+1
source

I like the XOM library because its interface is simple, intuitive and efficient. To do what you want with XML, you can use your own NodeFactory and (for example) override the finishMakingElement() method. If it does the necessary element (in your case <b> ), you pass it along with what you need to do with it.

0
source

You can do it with XProc

 <?xml version="1.0" encoding="ISO-8859-1"?> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"> <p:load href="in/huge-document.xml"/> <p:for-each> <p:iteration-source select="/a/b"/> <p:wrap match="/b" wrapper="a"/> <p:store> <p:with-option name="href" select="concat('part', p:iteration-position(), '.xml')"> <p:empty/> </p:with-option> </p:store> </p:for-each> </p:declare-step> 

You can use QuiXProc (Streaming XProc implementation: http://code.google.com/p/quixproc/ ) to try passing it also

0
source

All Articles