How to combine> 1000 xml files into one using Java

I am trying to merge many xml files into one. I have successfully done this in the DOM, but this solution is limited to a few files. When I run it on multiple files> 1000, I get java.lang.OutOfMemoryError.

What I want to achieve is when I have the following files

file 1:

<root> .... </root> 

file 2:

 <root> ...... </root> 

file n:

 <root> .... </root> 

leads to: Output:

 <rootSet> <root> .... </root> <root> .... </root> <root> .... </root> </rootSet> 

This is my current implementation:

  DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docFactory.newDocumentBuilder(); Document doc = docBuilder.newDocument(); Element rootSetElement = doc.createElement("rootSet"); Node rootSetNode = doc.appendChild(rootSetElement); Element creationElement = doc.createElement("creationDate"); rootSetNode.appendChild(creationElement); creationElement.setTextContent(dateString); File dir = new File("/tmp/rootFiles"); String[] files = dir.list(); if (files == null) { System.out.println("No roots to merge!"); } else { Document rootDocument; for (int i=0; i<files.length; i++) { File filename = new File(dir+"/"+files[i]); rootDocument = docBuilder.parse(filename); Node tempDoc = doc.importNode((Node) Document.getElementsByTagName("root").item(0), true); rootSetNode.appendChild(tempDoc); } } 

I experimented a lot with xslt, sax, but there seems to be something missing. Any help would be greatly appreciated

+8
java performance merge xml out-of-memory
source share
6 answers

You can also use StAX. Here is the code that will do what you want:

 import java.io.File; import java.io.FileWriter; import java.io.Writer; import javax.xml.stream.XMLEventFactory; import javax.xml.stream.XMLEventReader; import javax.xml.stream.XMLEventWriter; import javax.xml.stream.XMLInputFactory; import javax.xml.stream.XMLOutputFactory; import javax.xml.stream.events.XMLEvent; import javax.xml.transform.stream.StreamSource; public class XMLConcat { public static void main(String[] args) throws Throwable { File dir = new File("/tmp/rootFiles"); File[] rootFiles = dir.listFiles(); Writer outputWriter = new FileWriter("/tmp/mergedFile.xml"); XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory(); XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter); XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory(); xmlEventWriter.add(xmlEventFactory.createStartDocument()); xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet")); XMLInputFactory xmlInFactory = XMLInputFactory.newFactory(); for (File rootFile : rootFiles) { XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile)); XMLEvent event = xmlEventReader.nextEvent(); // Skip ahead in the input to the opening document element while (event.getEventType() != XMLEvent.START_ELEMENT) { event = xmlEventReader.nextEvent(); } do { xmlEventWriter.add(event); event = xmlEventReader.nextEvent(); } while (event.getEventType() != XMLEvent.END_DOCUMENT); xmlEventReader.close(); } xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet")); xmlEventWriter.add(xmlEventFactory.createEndDocument()); xmlEventWriter.close(); outputWriter.close(); } } 

One minor caveat is that this API seems to be confused with empty tags, changing <foo/> to <foo></foo> .

+8
source share

Just do it without any xml parsing as it does not require any actual xml parsing.

For efficiency, do something like this:

 File dir = new File("/tmp/rootFiles"); String[] files = dir.list(); if (files == null) { System.out.println("No roots to merge!"); } else { try (FileChannel output = new FileOutputStream("output").getChannel()) { ByteBuffer buff = ByteBuffer.allocate(32); buff.put("<rootSet>\n".getBytes()); // specify encoding too buff.flip(); output.write(buff); buff.clear(); for (String file : files) { try (FileChannel in = new FileInputStream(new File(dir, file).getChannel()) { in.transferTo(0, 1 << 24, output); } catch (IOException e) { e.printStackTrace(); } } buff.put("</rootSet>\n".getBytes()); // specify encoding too buff.flip(); output.write(buff); } catch (IOException e) { e.printStackTrace(); } 
+3
source share

The DOM needs to save the entire document in memory. If you do not need to perform any special operation with your tags, I would just use an InputStream and read all the files. If you need to perform some operations, use SAX.

+2
source share

Dom really consumes a lot of memory. You have imho, the following alternatives.

It is best to use SAX. Using a saxophone, only a very small amount of memory is used, since almost every element moves from input to output at any time, so the amount of memory is extremely low. However, using sax is not so simple, because compared to dom, it is a little contrary to expectations.

Try Stax, I haven’t tried it myself, but it’s a kind of saxophone on steroids, which is easier to implement and use, in contrast to the fact that you just accept saxophone events that you don’t control, you actually “ask the source” to pass you the elements you if you want it to fit in the middle between dom and sax, it has a memory track that looks like a saxophone, but a more friendly paradigm.

Sax, stax, dom are important if you want to properly save, declare, etc. namespaces and other oddities of XML.

However, if you just need a quick and dirty way, which is likely to be also compatible with the namespace, use plain old lines and writers.

Start displaying the declaration and root element of your “large” document on FileWriter. Then download, using dom, if you like, each individual file. Select the elements that you want to insert into the "large" file, convert them to a string and send them to the author. the writer will be flushed to disk without using a huge amount of memory, and dom will load only one document per iteration. If you also do not have very large files on the input side or plan to run it on a mobile phone, you should not have many memory problems. If dom serializes it correctly, it must save namespace declarations, etc., and the code will be just a bunch of lines more than the one you posted.

+2
source share

For this kind of work, I suggest not using the DOM, reading the contents of the file and creating a substring is simpler and more sufficient.

I am thinking of something like this:

 String rootContent = document.substring(document.indexOf("<root>"), document.lastIndexOf("</root>")+7); 

Then, to avoid a lot of completeness of memory. Write in the main file after each xml extraction with BufferedWritter , for example. For best performance, you can also use java.nio .

+1
source share

I think you do. The only way to make it large-scale for a really huge number of files is to use a text-based approach with streaming, so you never store everything in memory. But hey! Good news. Memory is cheap these days, and 64-bit JVMs are all the rage, so maybe you only need to increase the heap size. Try restarting your program with the -Xms1g JVM option (allocates the initial heap size of 1Gb).

I also use XOM for all my DOM requirements. Give it a try. Much more effective. I do not know for sure on the memory requirements, but an order of magnitude higher compared to my experience.

+1
source share

All Articles