Here's a crude private solution:
It reads input only once, constantly maintaining a sorted array of the top 10 rows.
Sorting the entire array each time is inefficient, of course, but I assume that for gigabyte input it will still be significantly faster than sort huge-file | head sort huge-file | head .
Adding an option to change the number of printed lines will be quite simple. Adding options to control sorting will be a little more difficult, although I would not be surprised if there is something in CPAN that helps with this.
More abstractly, one approach to getting only the first N sorted items from a large array is to use partial Quicksort, where you don't need to sort the correct section if you don't need to. This requires storing the entire array in memory, which is probably impractical in your case.
You can break the input into pieces of medium size, apply some clever algorithm to get the top N lines of each fragment, combine the fragments together, and then apply the same algorithm to the result. Depending on the size of the pieces sort ... | head sort ... | head can be smart enough. Itβs easy to build a script command using split -l ...
(Add more manual span if necessary).
Disclaimer: I just tried this in a much smaller file than what you are working with (about 1.7 million lines), and my method was slower than sort ... | head sort ... | head .
source share