What is the most efficient way to extract information from a large number of xml files in python?

Question

What is the most efficient way to extract information from a large number of xml files in python?

I have a complete directory (~ 10 ³ 10 ⁴ ) of XML files from which I need to extract the contents of several fields. I tested different xml parsers, and since I don't need to check the contents (expensive), I thought to just use xml.parsers.expat (the fastest) to go through the files, one by one, to extract the data.

Is there a more efficient way? (simple text match does not work)
Do I need to create a new ParserCreate () for each new file (or line) or can I reuse it for each file?
Any reservations?

Thanks!

+3

performance python xml large-files expat-parser

bgoncalves Dec 05 '08 at 17:15

source share

4 answers

Usually I suggest using ElementTree iterparse , or for ultra-fast, its analogue from lxml . Also try using Processing (shipped with 2.6) for parallelization.

The important thing in iterparse is that you get the element structures (sub) when they are analyzed.

 import xml.etree.cElementTree as ET xml_it = ET.iterparse("some.xml") event, elem = xml_it.next()

event will always be the string "end" in this case, but you can also initialize the parser to also tell you about the new elements as they are parsed. You have no guarantee that all children will be parsed at this point, but there are attributes if you are interested.

Another point is that you can stop reading elements from the iterator earlier, that is, before the entire document is processed.

If the files are large (are they?), There is a common idiom that allows you to maintain constant memory usage in the same way as in a streaming parser.

+4

Torsten marek Dec 08 '08 at 13:01

source share

If you know that XML files are generated using the same algorithm, it may be more efficient to not parse XML at all. For example. if you know that the data is on lines 3, 4 and 5, you can read the lines line by line and then use regular expressions.

Of course, this approach would fail if the files were not generated by the machine or originated from different generators, or if the generator changed over time. However, I am optimistic that this will be more effective.

Regardless of whether you recycle the parser objects, are largely irrelevant. Many more objects will be created, so the only parser object is actually not very many.

+1

Martin v. Löwis Dec 05 '08 at 17:49

source share

One thing you haven't mentioned is whether you are reading XML in the DOM. I guess you probably aren't, but if that happens, don't do it. Use xml.sax instead. Using SAX instead of the DOM will give you significant performance improvements.

+1

Robert rossney Dec 6 '08 at 12:52

source share

orip · Accepted Answer · 2008-12-05T18:08:02+0000

The fastest way would be to match strings (e.g. with regular expressions) instead of parsing XML - depending on your XML data, this might really work.

But the most important thing: instead of thinking over several options, just implement them and time them on a small set. It takes about the same time and gives you real numbers that will help you move forward.

EDIT:

Are the files on a local drive or a network drive? Network I / O will kill you here.
The problem is parallelized trivially: you can split the work between several computers (or several processes on a multicore computer).

What is the most efficient way to extract information from a large number of xml files in python?

More articles: