Parsing extremely large XML files in php

I need to parse 40GB XML files and then normalize and paste into a MySQL database. How much file I need to save in the database is not clear, and I don't know the XML structure.

Which parser should be used, and how would you do it?

+7
source share
2 answers

In PHP, you can read in extreme large XML files using XMLReader Docs :

 $reader = new XMLReader(); $reader->open($xmlfile); 

Extreme large XML files must be stored in compressed format on disk. At least that makes sense, because the XML files have a high compression ratio. For example, gzipped as large.xml.gz .

PHP supports this with XMLReader using the Docs compression wrapper :

 $xmlfile = 'compress.zlib://path/to/large.xml.gz'; $reader = new XMLReader(); $reader->open($xmlfile); 

XMLReader allows you to work with the current element "only". This means that this is only forward. If you need to save the state of the parser, you need to create it yourself.

I often find it useful to wrap basic movements in a set of iterators who know how to work with XMLReader , as iterating only through elements or child elements. You will find this in Parse XML with PHP and XMLReader .

See also:

+11
source

It would be nice to know what you are actually going to do with XML. The way you parse it depends a lot on the processing you have to do, as well as the size.

If this is a one-time task, then I started in the past by opening the XML structure before doing anything else. My DTDGenerator (see Saxon.sf.net) was written for this purpose a long time ago and still does the job, now there are other tools, but I don’t know if they do thread processing, which is a prerequisite here.

You can write an application that processes data using the pull or push parser (SAX or StAX). How easy it depends on how much processing you need to do and how much state you have to maintain, which you did not tell us. Alternatively, you can try the XSLT streaming available in Saxon-EE.

+2
source

All Articles