Creating very large XML files in Python?

Does anyone know of an efficient way to create large xml files (e.g. 100-500 MiB) in Python?

I use lxml , but memory usage is through the roof.

+7
python xml lxml
source share
4 answers

Perhaps you could use a template engine instead of generating / building xml yourself?

Genshi , for example, is xml-based and supports streaming output. A very simple example:

from genshi.template import MarkupTemplate tpl_xml = ''' <doc xmlns:py="http://genshi.edgewall.org/"> <p py:for="i in data">${i}</p> </doc> ''' tpl = MarkupTemplate(tpl_xml) stream = tpl.generate(data=xrange(10000000)) with open('output.xml', 'w') as f: stream.render(out=f) 

This may take some time, but memory usage remains low.

The same example for the Mako template (not "native" xml), but much faster:

 from mako.template import Template from mako.runtime import Context tpl_xml = ''' <doc> % for i in data: <p>${i}</p> % endfor </doc> ''' tpl = Template(tpl_xml) with open('output.xml', 'w') as f: ctx = Context(f, data=xrange(10000000)) tpl.render_context(ctx) 

The last example worked on my laptop for about 20 seconds, creating (admittedly very simple) 151 MB xml file, no memory problems whatsoever. (according to the Windows task manager, it remained constant for about 10 MB)

Depending on your needs, this may be a friendlier and faster way to generate xml than using SAX, etc. Check the docs to find out what you can do with these engines (there are others, I just chose these two examples)

+6
source share

The only reasonable way to create such a large XML file is line by line, which means printing when the state machine starts and lots of testing.

+2
source share

Obviously, you need to avoid creating the whole tree (whether it's DOM or etree or something else) in memory. But the best way depends on the source of your data and on how complex and interconnected the structure of your output is.

If it is large because it received thousands of instances of fairly independent elements, you can create an outer wrapper, and then build trees for each element, and then serialize each fragment in the output.

If the fragments are not so independent, you will need to do additional bookkeeping - for example, it is possible to manage a database of generated identifiers and idrefs.

I would break it into 2 or 3 parts: the sax event producer, the output serializer and, if you wish, if you prefer to work with some separate parts in the form of objects or trees, something to create these objects, and then turn them into sax events for serializer.

Perhaps you could just manage it as direct text output rather than deal with sax events: it depends on how complex it is.

It could also be a good place to use python generators as a way to stream output without the need to create large structures in memory.

+2
source share

If your document is very regular (for example, a collection of database records, all in one format), you can use my own xe library.

http://home.avvanta.com/~steveha/xe.html

The xe library was designed to create syndication channels (Atom, RSS, etc.), and I think it is easy to use. I need to update it for Python 2.6, and I still do not regret it.

0
source share

All Articles