What is a good XML XML parser for Python?

Are there XML parsers for Python that can parse file streams? My XML files are too large to fit in memory, so I need to parse the stream.

Ideally, I don't need root access to install things, so lxml not a good option.

I am using xml.etree.ElementTree , but I am convinced that it is broken .

+9
python xml stream parsing
source share
3 answers

Use xml.etree.cElementTree . This is much faster than xml.etree.ElementTree . None of them are broken. Your files are broken (see my answer to your other question).

+3
source share

Here's a good answer about xml.etree.ElementTree.iterparse practice with huge XML files. lxml also has a method. The key to stream analysis using iterparse is to manually clean and delete already processed nodes, because otherwise you will run out of memory.

Another option is to use xml.sax . The official guide is too formal for me, and there are no examples in it, so it needs clarification along with the question. The default parser module, xml.sax.expatreader , implements the xml.sax.xmlreader.IncrementalParser incremental analysis xml.sax.xmlreader.IncrementalParser . That is, xml.sax.make_parser() provides a suitable stream analyzer.

For example, for an XML stream, for example:

 <?xml version="1.0" encoding="utf-8"?> <root> <entry><a>value 0</a><b foo='bar' /></entry> <entry><a>value 1</a><b foo='baz' /></entry> <entry><a>value 2</a><b foo='quz' /></entry> ... </root> 

It can be processed as follows.

 #!/usr/bin/env python # -*- coding: utf-8 -*- import time import xml.sax class StreamHandler(xml.sax.handler.ContentHandler): lastEntry = None lastName = None def startElement(self, name, attrs): self.lastName = name if name == 'entry': self.lastEntry = {} elif name != 'root': self.lastEntry[name] = {'attrs': attrs, 'content': ''} def endElement(self, name): if name == 'entry': print({ 'a' : self.lastEntry['a']['content'], 'b' : self.lastEntry['b']['attrs'].getValue('foo') }) self.lastEntry = None elif name == 'root': raise StopIteration def characters(self, content): if self.lastEntry: self.lastEntry[self.lastName]['content'] += content if __name__ == '__main__': # use default ''xml.sax.expatreader'' parser = xml.sax.make_parser() parser.setContentHandler(StreamHandler()) # feed the parser with small chunks to simulate with open('data.xml') as f: while True: buffer = f.read(16) if buffer: try: parser.feed(buffer) except StopIteration: break # if you can provide a file-like object it as simple as with open('data.xml') as f: parser.parse(f) 
+14
source share

Are you looking for xml.sax ? This is correct in the standard library.

+8
source share

All Articles