"Out of memory" when parsing a large (100 MB) XML file using perl

I have a "Out of Memory" error while parsing a large (100 MB) XML file

use strict; use warnings; use XML::Twig; my $twig=XML::Twig->new(); my $data = XML::Twig->new ->parsefile("divisionhouserooms-v3.xml") ->simplify( keyattr => []); my @good_division_numbers = qw( 30 31 32 35 38 ); foreach my $property ( @{ $data->{DivisionHouseRoom}}) { my $house_code = $property->{HouseCode}; print $house_code, "\n"; my $amount_of_bedrooms = 0; foreach my $division ( @{ $property->{Divisions}->{Division} } ) { next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers; $amount_of_bedrooms += $division->{DivisionQuantity}; } open my $fh, ">>", "Result.csv" or die $!; print $fh join("\t", $house_code, $amount_of_bedrooms), "\n"; close $fh; } 

What can I do to fix this error?

+8
xml perl xml-twig
source share
2 answers

Processing large XML files that do not fit into memory is what XML::Twig advertises :

One of the strengths of XML::Twig is that it allows you to work with files that do not fit into memory (BTW stores an XML document in memory as a tree rather expensive, the extension factor is often around 10).

To do this, you can define handlers that will be called once a particular element has been fully analyzed. In these handlers, you can access the element and process it as you wish (...)


The code posted in the question does not use the power of XML::Twig (using the simplify method does not make it much better than XML::Simple ).

What is missing in the code is twig_handlers or twig_roots , which essentially make the parser effectively focus on the corresponding parts of the memory of the XML document.

It's hard to say, without seeing XML, processing fragments of a document or just selected parts is the way to go, but you need to solve this problem.

Thus, the code should look something like this (chunk-by-chunk demo):

 use strict; use warnings; use XML::Twig; use List::Util 'sum'; # To make life easier use Data::Dump 'dump'; # To see what going on my %bedrooms; # Data structure to store the wanted info my $xml = XML::Twig->new ( twig_roots => { DivisionHouseRoom => \&count_bedrooms, } ); $xml->parsefile( 'divisionhouserooms-v3.xml'); sub count_bedrooms { my ( $twig, $element ) = @_; my @divParents = $element->children( 'Divisions' ); my $id = $element->first_child_text( 'HouseCode' ); for my $divParent ( @divParents ) { my @divisions = $divParent->children( 'Division' ); my $total = sum map { $_->text } @divisions; $bedrooms{$id} = $total; } $element->purge; # Free up memory } dump \%bedrooms; 
+18
source share

See Processing a fragment of a chunk XML document in the XML :: Twig section, it specifically discusses how to process part of a document in parts, which allows you to process large XML files.

+8
source share

All Articles