Problem reading files larger than 1 GB using XMLReader

Is there a maximum file size that XMLReader can handle?

I am trying to process an XML feed of about 3 GB in size. Of course, there are no PHP errors, because the script works fine and successfully loads into the database after it starts.

The script also works great with smaller test channels - 1 GB or lower. However, when processing large channels, the script stops reading the XML file after about 1 GB and continues to work with the rest of the script.

Anyone having a similar problem? and if so, how did you get around?

Thanks in advance.

+4
source share
6 answers

I had the same problem recently and decided to share my experience.

It seems that the problem is how PHP was compiled, whether it was compiled with support for 64-bit file sizes / offsets, or only with 32-bit.

With 32 bits, you can access only 4 GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html

I had to split my files using the Perl xml_split , which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split

I used it to split my huge XML file into manageable chunks. The good thing about this tool is that it splits the XML files into all elements. Unfortunately, this is not very fast.

I needed to do this only once, and it met my needs, but I would not recommend reusing it. After splitting, I used XMLReader for smaller files about 1GB in size.

+2
source

Separating the file is sure to help. Other things to try ...

Depending on your OS, there may also be a 2 GB limit for the RAM block that you can allocate. It is very possible if you are running a 32-bit OS.

+1
source

It should be noted that PHP generally has a maximum file size. PHP does not allow unsigned integers or long integers, which means that you have 2 ^ 31 (or 2 ^ 63 for 64-bit systems) limited for integers. This is important because PHP uses an integer for the file pointer (your position in the file when reading), that is, it cannot process a file larger than 2 ^ 31 bytes.

However, this should be more than 1 gigabyte. I ran into problems with two gigabytes (as expected, since 2 ^ 31 is about 2 billion).

+1
source

I had a similar problem when analyzing large documents. What I ended up breaking up the feed into smaller chunks using file system functions and then parsing those smaller chunks ... So, if you have a bunch of <record> tags that you parse, parse them using string functions in as a stream, and when you get the full buffer entry, analyze that using the xml functions ... This sucks, but it works quite well (and is very memory efficient, since you only have 1 memory entry in any moment)...

0
source

Did you encounter errors with

 libxml_use_internal_errors(true); libxml_clear_errors(); // your parser stuff here.... $r = new XMLReader(...); // .... foreach( libxml_get_errors() as $err ) { printf(". %d %s\n", $err->code, $err->message); } 

when does the parser stop prematurely?

0
source

Using WindowsXP, NTFS as the file system and php 5.3.2 had no problems with this test script

 <?php define('SOURCEPATH', 'd:/test.xml'); if ( 0 ) { build(); } else { echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n"; timing('read'); } function timing($fn) { $start = new DateTime(); echo 'start: ', $start->format('Ymd H:i:s'), "\n"; $fn(); $end = new DateTime(); echo 'end: ', $start->format('Ymd H:i:s'), "\n"; echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n"; } function read() { $cnt = 0; $r = new XMLReader; $r->open(SOURCEPATH); while( $r->read() ) { if ( XMLReader::ELEMENT === $r->nodeType ) { if ( 0===++$cnt%500000 ) { echo '.'; } } } echo "\n#elements: ", $cnt, "\n"; } function build() { $fp = fopen(SOURCEPATH, 'wb'); $s = '<catalogue>'; //for($i = 0; $i < 500000; $i++) { for($i = 0; $i < 60000000; $i++) { $s .= sprintf('<item>%010d</item>', $i); if ( 0===$i%100000 ) { fwrite($fp, $s); $s = ''; echo $i/100000, ' '; } } $s .= '</catalogue>'; fwrite($fp, $s); flush($fp); fclose($fp); } 

exit:

 filesize: 1,380,000,023 start: 2010-08-07 09:43:31 ........................................................................................................................ #elements: 60000001 end: 2010-08-07 09:43:31 diff: 07:31 

(as you can see, I messed up the output of the end time, but I do not want to run this script for another 7+ minutes ;-))

Does this also work on your system?


As a side note: the corresponding C # test application took just 41 seconds instead of 7.5 minutes. In this case, my slow hard drive could be / the only limiting factor.

 filesize: 1.380.000.023 start: 2010-08-07 09:55:24 ........................................................................................................................ #elements: 60000001 end: 2010-08-07 09:56:05 diff: 00:41 

and source:

 using System; using System.IO; using System.Xml; namespace ConsoleApplication1 { class SOTest { delegate void Foo(); const string sourcepath = @"d:\test.xml"; static void timing(Foo bar) { DateTime dtStart = DateTime.Now; System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss")); bar(); DateTime dtEnd = DateTime.Now; System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss")); TimeSpan s = dtEnd.Subtract(dtStart); System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds); } static void readTest() { XmlTextReader reader = new XmlTextReader(sourcepath); int cnt = 0; while (reader.Read()) { if (XmlNodeType.Element == reader.NodeType) { if (0 == ++cnt % 500000) { System.Console.Write('.'); } } } System.Console.WriteLine("\n#elements: " + cnt + "\n"); } static void Main() { FileInfo f = new FileInfo(sourcepath); System.Console.WriteLine("filesize: {0:N0}", f.Length); timing(readTest); return; } } } 
0
source

All Articles