C-like arrays in perl

I want to create and manage large arrays (4 bytes) of integers in memory. By and large, I mean about hundreds of millions. Each cell in the array will act as a counter for a position on the chromosome. All I need is for it to be consistent with memory and have fast (O (1)) access to the elements. What I consider is not a rare function, so I cannot use a sparse array.

I cannot do this with the usual perl list, because perl (at least on my machine) uses 64 bytes per element, so the genomes of most of the organisms I work with are too large. I tried storing data on disk using SQLite and hash mapping, and although they work, they are very slow, especially on regular disks. (It works fine when I run on 4th raid 0).

I thought I could use PDL arrays, b / c PDL stores its arrays in the same way as C, using only 4 bytes per element. However, I found that the refresh rate would be painfully slow compared to perl lists:

use PDL; use Benchmark qw/cmpthese/; my $N = 1_000_000; my @perl = (0 .. $N - 1); my $pdl = zeroes $N; cmpthese(-1,{ perl => sub{ $perl[int(rand($N))]++; }, pdl => sub{ # note that I'm not even incrementing here just setting to 1 $pdl->set(int(rand($N)), 1); } }); 

Return:

  Rate pdl perl pdl 481208/s -- -87% perl 3640889/s 657% -- 

Does anyone know how to increase the performance of pdl set () or find out about another module that can do this?

+8
arrays perl pdl
source share
7 answers

I can’t say what kind of performance you will get, but I recommend using the vec function registered here to split the string into bit fields. I experimented and found that my Perl would tolerate a string up to 500_000_000 . which corresponds to 125,000,000 32-bit values.

 my $data = "\0" x 500_000_000; vec($data, 0, 32)++; # Increment data[0] vec($data, 100_000_000, 32)++; # Increment data[100_000_000] 

If this is not enough, there may be something in the Perl assembly that controls the limit. Alternatively, if you think you can get smaller fields - say, a 16-bit number - vec will take a field width of any power from 2 to 32.

Edit: I believe the row size limit is related to the maximum private working set of 2 GB on 32-bit Windows processes. If you use Linux or have 64-bit perl, you might be more fortunate than me.


I added to your test program like this

 my $vec = "\0" x ($N * 4); cmpthese(-3,{ perl => sub{ $perl[int(rand($N))]++; }, pdl => sub{ # note that I'm not even incrementing here just setting to 1 $pdl->set(int(rand($N)), 1); }, vec => sub { vec($vec, int(rand($N)), 32)++; }, }); 

giving these results

  Rate pdl vec perl pdl 472429/s -- -76% -85% vec 1993101/s 322% -- -37% perl 3157570/s 568% 58% -- 

therefore, using vec is two-thirds the speed of your own array. Supposedly acceptable.

+8
source share

The PDL command you want is indadd . (Thanks to Chris Marshall, PDL Pumpking, for pointing this out elsewhere .)

PDL is for what I call "vectorized" operations. Compared to C operations, Perl operations are quite slow, so you want the number of calls to the PDL method to be minimal and each call does a lot of work. For example, this test allows you to specify the number of updates for one session (as a command line parameter). The perl side should loop, but the PDL side only performs five or so functions:

 use PDL; use Benchmark qw/cmpthese/; my $updates_per_round = shift || 1; my $N = 1_000_000; my @perl = (0 .. $N - 1); my $pdl = zeroes $N; cmpthese(-1,{ perl => sub{ $perl[int(rand($N))]++ for (1..$updates_per_round); }, pdl => sub{ my $to_update = long(random($updates_per_round) * $N); indadd(1,$to_update,$pdl); } }); 

When I run this with argument 1, I get even worse performance than using set , which I expected:

 $ perl script.pl 1 Rate pdl perl pdl 21354/s -- -98% perl 1061925/s 4873% -- 

This is a lot of makeup space! But hold on there. If we perform 100 iterations per round, we get an improvement:

 $ perl script.pl 100 Rate pdl perl pdl 16906/s -- -18% perl 20577/s 22% -- 

And with 10,000 updates per round, PDL is four times better than Perl:

 $ perl script.pl 10000 Rate perl pdl perl 221/s -- -75% pdl 881/s 298% -- 

PDL continues to run about 4 times faster than regular Perl for even larger values.

Note that PDL performance may be degraded for more complex operations. This is because the PDL will allocate and break large, but temporary workspaces for intermediate operations. In this case, you may need to use Inline::Pdlpp . However, this is not a tool for beginners, so do not jump there until you have determined that this is really the best for you.

Another alternative to all of this is to use Inline::C as follows:

 use PDL; use Benchmark qw/cmpthese/; my $updates_per_round = shift || 1; my $N = 1_000_000; my @perl = (0 .. $N - 1); my $pdl = zeroes $N; my $inline = pack "d*", @perl; my $max_PDL_per_round = 5_000; use Inline 'C'; cmpthese(-1,{ perl => sub{ $perl[int(rand($N))]++ for (1..$updates_per_round); }, pdl => sub{ my $to_update = long(random($updates_per_round) * $N); indadd(1,$to_update,$pdl); }, inline => sub{ do_inline($inline, $updates_per_round, $N); }, }); __END__ __C__ void do_inline(char * packed_data, int N_updates, int N_data) { double * actual_data = (double *) packed_data; int i; for (i = 0; i < N_updates; i++) { int index = rand() % N_data; actual_data[index]++; } } 

For me, the Inline function is consistently superior to both Perl and PDL. For large $updates_per_round values, say 1000, I get the Inline::C version about 5 times faster than pure Perl and between 1.2x and 2x faster than PDL. Even when $updates_per_round is just 1, where Perl removes the PDL, Inline code is 2.5 times faster than Perl code.

If this is all you need to accomplish, I recommend using Inline::C

But if you need to do a lot of manipulation of your data, it is best to stick with the PDL for its power, flexibility, and performance. See below how you can use vec() with PDL data.

+7
source share

PDL::set() and PDL::get() are intended more as a tutorial than anything else. They are a pessimistic way to access PDL variables. You will be much better off using some of the built-in mass access procedures. The PDL constructor itself accepts Perl lists:

 $pdl = pdl(@list) 

and fast enough. You can also load your data directly from an ASCII file using PDL::rcols or from a binary file using one of many I / O routines. If you have data in the form of a packed string in machine order, you can directly access the PDL memory:

 $pdl = PDL->new_from_specification(long,$elements); $dr = $pdl->get_dataref; $$dr = get_my_packed_string(); $pdl->upd_data; 

Also note that you can “have your own cake and eat it” using PDL objects to store integer arrays, PDL calculations (like indadd ) for large-scale data manipulation, and use vec() directly from the PDL data as a string which you can get with the get_dataref method:

 vec($$dr,int(rand($N)),32); 

You will need bswap4 data if you are on a little-endian system:

 $pdl->bswap4; $dr = $pdl->get_dataref; vec($$dr,int(rand($N)),32)++; $pdl->upd_data; $pdl->bswap4; 

Et voila!

+4
source share

PDL wins when operations can be streamed, apparently not optimized for random access and assignment. Perhaps someone with a lot of PDL knowledge can help.

+2
source share

Packed :: Array in CPAN can help.

Packed :: Array provides a packaged integer class of arrays. Arrays created using Packed :: Array can only contain signed integers that match your platform integers, but take up as much memory as is needed to store them. Thus, for 32-bit systems, and not for writing about 20 bytes, they occupy only 4.

+2
source share

since used integers which should be ok for use with chromosomes try this

 use PDL; use Benchmark qw/cmpthese/; my $N = 1_000_000; my @perl; @perl = (0 .. $N - 1); my $pdl; $pdl = (zeroes($N)); cmpthese(-1,{ perl => sub{ $perl[int(rand($N))]++; }, pdl2 => sub{ # note that I'm not even incrementing here just setting to 1 $pdl->set(int(rand($N)), 1); $pdl2 = pack "w*", $pdl; } }); 

and from this I got from this ...

  Rate pdl2 perl pdl2 46993/s -- -97% perl 1641607/s 3393% -- 

which shows a big difference in performance from the moment when I first tried this code without adding to my 2 cents, I got

  Rate pdl perl pdl 201972/s -- -86% perl 1472123/s 629% -- 
+2
source share

My answer above may not be useful ... it may help you ...

  use PDL; $x = sequence(45000,45000); 

now that will not work if you do not have 16 GB of RAM and use

 $PDL::BIGPDL=1; 
0
source share

All Articles