I understand that both Java and Perl are pretty hard to find the default buffer size for each size when reading in files, but I find that their selection is becoming more and more obsolete, and I have a problem with the default selection when it comes to Perl .
In the case of Perl, which I suppose uses 8K default buffers similar to Java's choice, I cannot find the link using the perldoc (really Google) site search engine on how to increase the default input buffer size, say 64K.
From the above link, to show how 8K buffers don't scale:
If lines usually contain about 60 characters, then a file with 10,000 lines contains about 610,000 characters. To read a line in a buffered line, only 75 system calls and 75 wait times for the disk are required instead of 10.001.
So, for a file with 50,000,000 lines with 60 characters per line (including a new line at the end) with 8K buffer, it will make 366211 system calls to read the 2.8GiB file. As an aside, you can confirm this behavior by looking at delta (or on Windows, at least the top in * nix shows the same, as I am sure) in the task manager process list, as your Perl program takes 10 minutes to read in a text file :)
Someone asked a question about increasing the size of the Perl input buffer on perlmonks, someone answered here to increase the size of $ / ", and thus increase the size of the buffer, however from perldoc:
Setting $ / to an integer reference, a scalar containing an integer, or a scalar converting to an integer will try to read records instead of strings, with the maximum record size being the integer to which it refers.
Therefore, I assume that this does not actually increase the size of the buffer that Perl uses to read from disk when using typical:
while(<>) {
line by line idioms.
Now it may happen that another “read record at a time, and then parse it into lines” version of the above code will be faster in the general case and bypass the main problem with the standard idiom and not able to change the default buffer size (if this is really impossible), because you can set the “record size” to whatever you want, and then analyze each record in separate lines and hope that Perl will do everything right and end up making one system call per record, but that adds complexity and that’s all what I really want to do is get a simple performance increase by increasing the buffer used in the above example to a sufficiently large size, say 64 KB, or even adjusting the buffer size to the optimal size for long readings using the test script on my system without requiring extra hassle.
Everything is much better in Java, since direct support for increasing the size of the buffer is supported.
In Java, I believe that the current default buffer size that java.io.BufferedReader uses is also 8192 bytes, although modern links in JDK documents are ambiguous, for example, only 1.5 documents:
You can specify a buffer size, or the default size can be accepted. The default value is large enough for most purposes.
Fortunately with Java, you don’t need to trust the JDK developers to make the right decision for your application and to set their own buffer size (in this example 64K):
import java.io.BufferedReader; [...] reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"), 65536); [...] while (true) { String line = reader.readLine(); if (line == null) { break; } foo(line); }
Only so much productivity can you squeeze out from parsing one line at a time, even with a huge buffer and modern equipment, and I'm sure there are ways to get every ounce of performance from reading in a file by reading large multi-line records and breaking them into tokens, and then do things with these tokens once per record, but they add complexity and extreme cases (although if there is an elegant solution in pure Java (only using the functions present in JDK 1.5), that would be great to find out). Increasing the buffer size in Perl would solve 80% of the performance problem for Perl, at least by keeping things straightforward.
My question is:
Is there a way to configure this buffer size in Perl for the typical “phased” idiom described above, similar to how the buffer size was increased in the Java example?