What is the largest SSIS XML file that can retrieve data?

We have an architecture in which we use SSIS to extract data from batch XML files into an intermediate database for validation before exporting it to production.

To some extent, we control the XML format, and I was asked to determine how many records should contain a batch XML file. Based on the XML schema and some sample data, I can estimate the average record size and make some predictions from there.

However, on this basis, from a different angle, I would like to get an indication of the technical limitations of SSIS when working with large XML files.

I know that SSIS smoothes and converts an XML document into its own in-memory tabular representation, so RAM becomes an obvious limiting factor, but in what proportion?

Can you say something like that SSIS requires at least 2.5 times the size of the file you are trying to download in available memory? Assuming I have a 32 GB cell designed for this data load function, how large are my XML files?

I know that other factors can be included, such as the complexity of the scheme, the number of nested elements, etc., but it would be nice to have a starting point.

+4
source share
2 answers

Xml Source does not load the entire document into memory, but outputs data from an XML file. Therefore, if you read XML and write it, for example. text files without complex conversions, you need a relatively small memory. Also, the amount of memory you need (after a certain threshold) stops growing when an XML file grows, so you can process potentially unlimited XML files.

eg. this guy exported all the content on Wikipedia (XML file 20 GB): http://www.ideaexcursion.com/2009/01/26/import-wikipedia-articles-into-sql-server-with-ssis/

Of course, you are likely to do something with this data, for example. Merge multiple streams coming from an XML source. Depending on what you need, you may need more memory because some conversions store the entire data set in memory or work much better if you have enough memory for the entire data set.

+3
source

It's not so easy.

First of all, keep in mind that SSIS "aligns" XML, so there is one output from the XML source for each path through XML. Trivial example:

<Parent><Child><Grandchild/></Child></Parent>

will produce three outputs and three error outputs. Deteriorating:

<Parent><Child><Grandchild><Notes/></Grandchild><Notes/></Child><Notes/></Parent>

This will lead to the conclusions of the parent, child, grandson, parent-child-grandson-notes, parent-child notes and parent notes, both normal and error.

The project I was working on had about 203 outputs. I was able to smooth out the XML schema and create only 19 or so. This is still a lot, considering that each output should have its own processing.

In addition, the XML task cannot process 1 GB or more XML. It really loads the entire document into memory. Try making XmlDocument.Load such a file and see what happens - what happens with SSIS.

I had to create an “XML element source” of my own, which processed the children of the root element one at a time. This allowed me to flatten XML, and also process large documents (a 10 GB test document worked).

There is more fun depending on what you want to do with the data received. In my case, we had to send each of the outputs to staging tables. This is not bad, but you should understand that the data on the outputs is asynchronous. One child (with descendants) will end the output paths a bit, and you will never know when all descendants have finished processing. This does not allow processing processing on a transactional basis one element at a time.

Instead, SSIS adds a surrogate key (I think it called) to each child element. Would a ParentID be added to the parent, the child of the child, and a child of the ChildParentID would also be added to the child to refer to the parent of the child. They can be used to "re-join the element again", but only after all the data has been written to the staging tables. This is the only time you can be sure that any given element has been completely processed - when they are all there!

+2
source

All Articles