Usually I suggest using ElementTree iterparse , or for ultra-fast, its analogue from lxml . Also try using Processing (shipped with 2.6) for parallelization.
The important thing in iterparse is that you get the element structures (sub) when they are analyzed.
import xml.etree.cElementTree as ET xml_it = ET.iterparse("some.xml") event, elem = xml_it.next()
event will always be the string "end" in this case, but you can also initialize the parser to also tell you about the new elements as they are parsed. You have no guarantee that all children will be parsed at this point, but there are attributes if you are interested.
Another point is that you can stop reading elements from the iterator earlier, that is, before the entire document is processed.
If the files are large (are they?), There is a common idiom that allows you to maintain constant memory usage in the same way as in a streaming parser.
source share