What will be your choice of Perl XML Parsers for files larger than 15 GB?

I know that there are some very good Perl XML parsers, such as XML :: Xerces , XML :: Parser :: Expat , XML :: Simple , XML :: RapidXML, XML :: LibXML , XML :: Liberal , etc. .

Which XML parser would you choose to parse large files, and by what parameter would you define one above the other? If the one you want to select is not listed, please offer it.

+4
source share
7 answers

If you parse files of this size, you need to avoid any parser that tries to load the entire document in memory and build a domain object model (DOM).

Instead, find a SAX style parser β€” one that treats the input file as a stream, raising events when events and attributes are encountered. This approach allows you to process the file gradually, without having to store the whole thing in memory at once.

+14
source

With a 15 gigabyte file, your parser must be based on SAX, because with such file sizes, just being able to process data is your first task.

I recommend you read XML :: SAX :: Intro .

+9
source

The SAX parameter is one option. Other parameters that are not related to loading the entire document into memory are XML :: Twig and XML :: Rules .

+5
source

I have always used XML :: Parser to parse such files. Simple, affordable anywhere and working well.

+4
source

You may also consider using a database with XML extensions (see here ). You can bulk load XML data into a database, then you can execute SQL queries (or XQueries) on that data.

+3
source

As you would expect, I suggest XML :: Twig , which will allow you to process the file fragment. This, of course, assumes that you can process the file this way. This will probably be easier to use than SAX, since you can process the tree for each fragment using DOM-like methods.

An alternative would be to use the pull parser mode , which is a bit like what XML :: Twig offers.

+3
source

I am going to make a mutated version of tster's answer above. Load the damn thing into the database (if possible, using direct XML import, if not, using the SAX parser to parse the file and create loadable datasets). Then use the database as a data warehouse. In 15G, you go far beyond the data you need to manipulate outside the database.

+2
source

All Articles