How to adjust the size of the buffer for reading files in Perl to optimize it for large files?

I understand that both Java and Perl are pretty hard to find the default buffer size for each size when reading in files, but I find that their selection is becoming more and more obsolete, and I have a problem with the default selection when it comes to Perl .

In the case of Perl, which I suppose uses 8K default buffers similar to Java's choice, I cannot find the link using the perldoc (really Google) site search engine on how to increase the default input buffer size, say 64K.

From the above link, to show how 8K buffers don't scale:

If lines usually contain about 60 characters, then a file with 10,000 lines contains about 610,000 characters. To read a line in a buffered line, only 75 system calls and 75 wait times for the disk are required instead of 10.001.

So, for a file with 50,000,000 lines with 60 characters per line (including a new line at the end) with 8K buffer, it will make 366211 system calls to read the 2.8GiB file. As an aside, you can confirm this behavior by looking at delta (or on Windows, at least the top in * nix shows the same, as I am sure) in the task manager process list, as your Perl program takes 10 minutes to read in a text file :)

Someone asked a question about increasing the size of the Perl input buffer on perlmonks, someone answered here to increase the size of $ / ", and thus increase the size of the buffer, however from perldoc:

Setting $ / to an integer reference, a scalar containing an integer, or a scalar converting to an integer will try to read records instead of strings, with the maximum record size being the integer to which it refers.

Therefore, I assume that this does not actually increase the size of the buffer that Perl uses to read from disk when using typical:

while(<>) { #do something with $_ here ... } 

line by line idioms.

Now it may happen that another “read record at a time, and then parse it into lines” version of the above code will be faster in the general case and bypass the main problem with the standard idiom and not able to change the default buffer size (if this is really impossible), because you can set the “record size” to whatever you want, and then analyze each record in separate lines and hope that Perl will do everything right and end up making one system call per record, but that adds complexity and that’s all what I really want to do is get a simple performance increase by increasing the buffer used in the above example to a sufficiently large size, say 64 KB, or even adjusting the buffer size to the optimal size for long readings using the test script on my system without requiring extra hassle.

Everything is much better in Java, since direct support for increasing the size of the buffer is supported.

In Java, I believe that the current default buffer size that java.io.BufferedReader uses is also 8192 bytes, although modern links in JDK documents are ambiguous, for example, only 1.5 documents:

You can specify a buffer size, or the default size can be accepted. The default value is large enough for most purposes.

Fortunately with Java, you don’t need to trust the JDK developers to make the right decision for your application and to set their own buffer size (in this example 64K):

 import java.io.BufferedReader; [...] reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"), 65536); [...] while (true) { String line = reader.readLine(); if (line == null) { break; } /* do something with the line here */ foo(line); } 

Only so much productivity can you squeeze out from parsing one line at a time, even with a huge buffer and modern equipment, and I'm sure there are ways to get every ounce of performance from reading in a file by reading large multi-line records and breaking them into tokens, and then do things with these tokens once per record, but they add complexity and extreme cases (although if there is an elegant solution in pure Java (only using the functions present in JDK 1.5), that would be great to find out). Increasing the buffer size in Perl would solve 80% of the performance problem for Perl, at least by keeping things straightforward.

My question is:

Is there a way to configure this buffer size in Perl for the typical “phased” idiom described above, similar to how the buffer size was increased in the Java example?

+5
java performance file-io perl
source share
3 answers

You can influence buffering by assuming that you are running O / S that supports setvbuf. See the documentation for IO :: Handle . You do not need to explicitly create an IO :: Handle object, as in the documentation, if you use perl 5.10; all descriptors are implicitly IO :: Handles since release.

 use 5.010; use strict; use warnings; use autodie; use IO::Handle '_IOLBF'; open my $handle, '<:utf8', 'foo'; my $buffer; $handle->setvbuf($buffer, _IOLBF, 0x10000); while ( my $line = <$handle> ) { ... } 
+6
source share

No, no (without recompiling the modified perl), but you can read the entire file in memory, and then work on a line from it:

 use File::Slurp; my $buffer = read_file("filename"); open my $in_handle, "<", \$buffer; while ( my $line = readline($in_handle) ) { } 

Note that perl before 5.10 used stdio buffers by default in most places (but often cheated and accessed buffers directly, not through the stdio library), but in 5.10 and later versions have their own perlio level system by default. The latter seems to use a 4k buffer by default, but writing a layer that allows you to set it up should be trivial (once you figure out how to write a layer: see perldoc perliol ).

+2
source share

A warning. The following code passed the test only. The code below is the first snapshot of a function that allows you to process a file line by line (hence the name of the function) with a user-defined buffer size. It takes up to four arguments:

  • open file descriptor (default is STDIN )
  • buffer size (default - 4k)
  • variable reference to store the string in (default is $_ )
  • anonymous subroutine for calling a file (default print string).

The arguments are positional with the exception that the last argument can always be an anonymous subroutine. Lines are automatically compressed.

Probable errors:

  • may not work on systems where the line feed is the end of line character
  • most likely will not work in conjunction with the lexical $_ (introduced in Perl 5.10)

It can be seen from strace that it is reading a file with the specified buffer size. If I like how the testing goes, you can see it on CPAN soon.

 #!/usr/bin/perl use strict; use warnings; use Scalar::Util qw/reftype/; use Carp; sub line_by_line { local $_; my @args = \( my $fh = \*STDIN, my $bufsize = 4*1024, my $ref = \$_, my $coderef = sub { print "$_\n" }, ); croak "bad number of arguments" if @_ > @args; for my $arg_val (@_) { if (reftype $arg_val eq "CODE") { ${$args[-1]} = $arg_val; last; } my $arg = shift @args; $$arg = $arg_val; } my $buf; my $overflow =''; OUTER: while(sysread $fh, $buf, $bufsize) { my @lines = split /(\n)/, $buf; while (@lines) { my $line = $overflow . shift @lines; unless (defined $lines[0]) { $overflow = $line; next OUTER; } $overflow = shift @lines; if ($overflow eq "\n") { $overflow = ""; } else { next OUTER; } $$ref = $line; $coderef->(); } } if (length $overflow) { $$ref = $overflow; $coderef->(); } } my $bufsize = shift; open my $fh, "<", $0 or die "could not open $0: $!"; my $count; line_by_line $fh, sub { $count++ if /lines/; }, $bufsize; print "$count\n"; 
+1
source share

All Articles