Is there any DOM implementation that doesn't load the whole file during parsing?

Question

Is there any DOM implementation that doesn't load the whole file during parsing?

I have some XML files that are larger than the available memory, and a large (!) Code base that assumes that it can work with this file using the DOM structure. However, some users have reported OutOfMemoryException s on large input sizes; and XML is more than the address space available on 32-bit processors.

Is there a DOM implementation out there that can deal with this case and only “hydrate” the child objects as needed to ensure that memory is used reasonably with huge XML files?

+4

dom c # xml

Billy oneal Aug 9 '12 at 20:19

source share

3 answers

The DOM object model relies on the fact that all data is loaded into memory. Even if you find an implementation that delays stock loading, you will run out of memory anyway if DOM api users cross the entire DOM tree.

Essentially, you would save memory when you make XMemorySavingXDocument .Load ("big.xml") `, but the first XPath or LINQ request will still throw an OutOfMemoryException. This is true if any of the requests goes through the complete DOM tree. If you can make sure this never happens, you could leave with such a lazy DOM tree.

I do not know such an implementation, but I doubt that this will help in your case anyway. As you said, a large number of Api DOM users will travel through the DOM tree for all nodes, and you will get an OutOfMemoryException in a few minutes with this solution.

The XML DOM object model “decompresses” an XML file into an in-memory representation that consumes 7 times more memory (x64) than the original file. For 32 bits, it's still about 3.5.

The reason the XML DOM model is so bloated is because every dom node knows its children, parents and attributes. These are object references for each DOM node, which are very expensive for you.

A managed class object consumes at least 12/24 bytes per instance. Since each node pointer adds another 4/8 bytes (x86, x64) to the total memory consumption, you run out of memory with a large XML file. See the article for more information on .NET object sizes.

Since the DOM is not a good idea for large XML files, but your current architecture requires a DOM, I am afraid that you will need to distract the DOM and replace it with an API that extracts (and potentially modifies) the material you are interested in. In a large organization, you can bring This topic is up to the architects and present them as a major redesign with the obligatory presence of prio.

If you are still lucky that you have made a commitment on the part of architects and managers, then some third-party programmers in countries you have never been to get your next big blank element to work ;-).

To give you a few digits, how much the data format affects performance, I created a file with 1 million integers. I used 3 different data formats

Binary 40 MB
ASCII text file 80 MB (ddd \ r \ nddd \ r \ n ...)
Xml file 170 MB ( 1 \ r \ n 2 ....)

Then I read them in a 64 bit process

0.1s Binary file via memory mapped file
0.5s BinaryReader
2.5s text file
5.3s XmlReader (Streaming)
8.6s XDocument.Load

Memory consumption was flat at ~ 200 MB, with the exception of XDocument.Load, which led to a peak memory of 1.2 GB. Your first goal may be different, but I would first convert the Xml material via streaming XmlReader to a binary format that can be downloaded much faster.

+1

Alois kraus Aug 9 '12 at 21:03

source share

This is not an optimal solution, but in the past I read in XML files as strings and used regular expressions to break fragments into my own DOM objects.

Perhaps you can also use XPath? (Https://developer.mozilla.org/en-US/docs/Using_XPath)

0

Micah hening Aug 9 '12 at 20:25

source share

spender · Accepted Answer · 2012-08-09T20:47:05+0000

There is a great solution described in a two-part message from MS XmlTeam to take advantage of linq2xml, but streaming the file rather than downloading in general. After many deadlocks and dead ends, this was the solution that I settled on when reading> 10GB xml files from database dumps.

Is there any DOM implementation that doesn't load the whole file during parsing?

More articles: