Here's a good answer about xml.etree.ElementTree.iterparse practice with huge XML files. lxml also has a method. The key to stream analysis using iterparse is to manually clean and delete already processed nodes, because otherwise you will run out of memory.
Another option is to use xml.sax . The official guide is too formal for me, and there are no examples in it, so it needs clarification along with the question. The default parser module, xml.sax.expatreader , implements the xml.sax.xmlreader.IncrementalParser incremental analysis xml.sax.xmlreader.IncrementalParser . That is, xml.sax.make_parser() provides a suitable stream analyzer.
For example, for an XML stream, for example:
<?xml version="1.0" encoding="utf-8"?> <root> <entry><a>value 0</a><b foo='bar' /></entry> <entry><a>value 1</a><b foo='baz' /></entry> <entry><a>value 2</a><b foo='quz' /></entry> ... </root>
It can be processed as follows.
#!/usr/bin/env python # -*- coding: utf-8 -*- import time import xml.sax class StreamHandler(xml.sax.handler.ContentHandler): lastEntry = None lastName = None def startElement(self, name, attrs): self.lastName = name if name == 'entry': self.lastEntry = {} elif name != 'root': self.lastEntry[name] = {'attrs': attrs, 'content': ''} def endElement(self, name): if name == 'entry': print({ 'a' : self.lastEntry['a']['content'], 'b' : self.lastEntry['b']['attrs'].getValue('foo') }) self.lastEntry = None elif name == 'root': raise StopIteration def characters(self, content): if self.lastEntry: self.lastEntry[self.lastName]['content'] += content if __name__ == '__main__': # use default ''xml.sax.expatreader'' parser = xml.sax.make_parser() parser.setContentHandler(StreamHandler()) # feed the parser with small chunks to simulate with open('data.xml') as f: while True: buffer = f.read(16) if buffer: try: parser.feed(buffer) except StopIteration: break # if you can provide a file-like object it as simple as with open('data.xml') as f: parser.parse(f)
saaj
source share