Parsing large XML files?

Question

Parsing large XML files?

I have 2 xml files 1 with a size of 115 MB and the other with a size of 34 MB.

The Wiile A read file has 1 field called desc, which associates it with file B, where I extract the field identifier from file B, where desc.file A is iqual for name.file B.

file A is already too large, so I have to look inside file B, and it takes a very long time to complete.

How could I speed up this process or what would be better for it?

current code i am using:

#!/usr/bin/perl use strict; use warnings; use XML::Simple qw(:strict XMLin); my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]); my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]); my ($nameid,$rank); open (my $fh, '>>', 'Output/npc_templates.xml'); print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n"; foreach my $npc ( @{ $npcs->{npc_client} } ) { if (defined $npc->{desc}) { foreach my $string (@{$strings->{string}}) { if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) { $nameid = $string->{id}; last; } } } else { $nameid = ""; } if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) { $rank = 'LEGENDARY'; } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) { $rank = 'HERO'; } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) { $rank = 'ELITE'; } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) { $rank = 'NORMAL'; } else { $rank = $gauge; } print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|; } print $fh "</<npc_templates>"; close($fh);

Example A.xml file:

 <?xml version="1.0" encoding="utf-16"?> <npc_clients> <npc_client> <id>200000</id> <name>SkillZone</name> <desc>STR_NPC_NO_NAME</desc> <dir>Monster/Worm</dir> <mesh>Worm</mesh> <material>mat_mob_reptile</material> <show_dmg_decal>0</show_dmg_decal> <ui_type>general</ui_type> <cursor_type>none</cursor_type> <hide_path>0</hide_path> <erect>1</erect> <bound_radius> <front>1.200000</front> <side>3.456000</side> <upper>3.000000</upper> </bound_radius> <scale>10</scale> <weapon_scale>100</weapon_scale> <altitude>0.000000</altitude> <stare_angle>75.000000</stare_angle> <stare_distance>20.000000</stare_distance> <move_speed_normal_walk>0.000000</move_speed_normal_walk> <art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk> <move_speed_normal_run>0.000000</move_speed_normal_run> <move_speed_combat_run>0.000000</move_speed_combat_run> <art_org_speed_combat_run>0.000000</art_org_speed_combat_run> <in_time>0.100000</in_time> <out_time>0.500000</out_time> <neck_angle>90.000000</neck_angle> <spine_angle>10.000000</spine_angle> <ammo_bone>Bip01 Head</ammo_bone> <ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx> <ammo_speed>50</ammo_speed> <pushed_range>0.000000</pushed_range> <hpgauge_level>3</hpgauge_level> <magical_skill_boost>0</magical_skill_boost> <attack_delay>2000</attack_delay> <ai_name>SummonSkillArea</ai_name> <tribe>General</tribe> <pet_ai_name>Pet</pet_ai_name> <sensory_range>15.000000</sensory_range> </npc_client> </npc_clients>

example B.xml file:

 <?xml version="1.0" encoding="utf-16"?> <strings> <string> <id>350000</id> <name>STR_NPC_NO_NAME</name> <body> </body> </string> </strings>

+4

performance perl xml-parsing

Guapo Nov 14 '10 at 15:14

source share

4 answers

Grab all the interesting "desc" fields from file A and put them in a hash. You only need to parse it once, but if it still takes too much time, check out XML :: Twig .
Take the B. file and extract the necessary materials. Use a hash.

It looks like you only need parts of the xml files. XML :: Twig can only parse the elements that interest you and throw away the rest using the twig_roots parameter. XML :: Simple is easier to get started though ..

+1

Øyvind Skaar Nov 15 '10 at 10:41

source share

Although I cannot help you with the specifics of your Perl code, there are some general guidelines for working with large amounts of XML data. There are, in general, two kinds of XML APIs - those based on the DOM and Stream. A Dom-based API (such as XML DOM) will parse the entire XML document in memory before the user-level API is “accessible”, while using a thread-based API (such as SAX) the implementation does not need to be parsed of the whole XML document. One of the advantages of stream-based parsers is that they usually use much less memory, since they do not need to store the entire XML document in memory at the same time - this is obviously good when it comes to large XML documents. Looking at XML :: Simple docs here, it seems that SAX support may be available - have you tried this?

0

Robin Nov 14 '10 at 15:50

source share

I'm not a perl person, so take this with salt, but I see 2 problems:

The fact that you repeat all the values in file B until you find the correct value for each element in file A is ineffective. Instead, you should use some kind of map / dictionary for the values in file B.
It looks like you are parsing both files in memory before you start processing. File A is best treated as a stream rather than loading the entire document into memory.

0

Jeff knecht Nov 14 '10 at 15:51

source share

bvr · Accepted Answer · 2010-11-15T11:55:31+0000

Here is an example of using XML::Twig . The main advantage is that it does not store the entire file in memory, so processing is much faster. The code below tries to emulate the script from the question.

 use XML::Twig; my %strings = (); XML::Twig->new( twig_handlers => { 'strings/string' => sub { $strings{ lc $_->first_child('name')->text } = $_->first_child('id')->text }, } )->parsefile('B.xml'); print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n"; XML::Twig->new( twig_handlers => { 'npc_client' => sub { my $nameid = eval { $strings{ lc $_->first_child('desc')->text } }; # calculate rank as needed my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text }; $rank = $hpgauge_level >= 28 ? 'ERROR' : $hpgauge_level > 25 ? 'LEGENDARY' : $hpgauge_level > 21 ? 'HERO' : $hpgauge_level > 10 ? 'ELITE' : $hpgauge_level > 0 ? 'NORMAL' : $hpgauge_level; my $npc_id = eval { $_->first_child('id')->text }; my $name = eval { $_->first_child('name')->text }; my $tribe = eval { $_->first_child('tribe')->text }; my $scale = eval { $_->first_child('scale')->text }; my $race_type = eval { $_->first_child('race_type')->text }; print qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|; $_->purge; } } )->parsefile('A.xml'); print "</<npc_templates>";

Parsing large XML files?

More articles: