How should I parse large XML files in Perl?

Reading XML data, as in the following code, creates a DOM tree in memory?

my $xml = new XML::Simple; my $data = $xml->XMLin($blast_output,ForceArray => 1); 

For large XML files, I have to use a SAX parser with handlers, etc.

+4
source share
3 answers

I would say yes to both. The XML :: Simple library will create the entire tree in memory, and this will be a large multiple file size. For many applications, if your XML is more than 100 MB or so, it is almost impossible to fully load into memory in perl. The SAX parser is a way to get "events" or notifications when reading a file, and tags open or close.

Depending on your usage patterns, either the SAX parser or the DOM may be faster: for example, if you are trying to process only a few nodes or each node in a large file, SAX mode is probably best. For example, reading a large RSS feed and trying to parse every element in it.

On the other hand, if you need to cross-reference one part of a file to another part, a DOM parser or access through XPath will make more sense - writing it with the inside out method that the SAX parser requires will be awkward and complicated.

I recommend using the SAX parser at least once, because it requires reasonable thinking, this is a good exercise.

I have had good success with XML :: SAX :: Machines for setting up SAX analysis in perl - if you want to easily configure multiple filters and pipelines. For simpler settings (for example, in 99% of cases) you only need one sax filter (look at XML :: Filter :: Base) and tell XML :: SAX :: Machines to simply parse the file (or read from the file descriptor) using your filter. Here is a detailed article.

+4
source

For large XML files, you can use XML :: LibXML in DOM mode if the document fits in memory or use pull mode (see XML :: LibXML :: Reader ) or XML :: Twig (which I wrote, so I am biased, but it works generally well for files that are too large to fit in memory).

I'm not a fan of SAX, which is hard to use and actually quite slow.

+14
source

I have not used the XML :: Simple module before, but from the documentation it creates a simple hash in memory. This is not a complete DOM tree, but it may well be enough for your requirements.

For large XML files, using the SAX parser will be faster and have less memory, but then it will again depend on your needs. If you just need to process the data in a serial way, then using XML :: SAX is likely to suit your needs. If you need to manipulate your entire tree, you might be better off using something like XML :: LibXML .

These are all horses for the courses that I fear.

+4
source

All Articles