What standard commands can I use to efficiently print only the first few lines of sorted output on the command line?

I basically want the equivalent

... | sort -arg1 -arg2 -... | head -n $k 

but I understand that sorting will go O (n log n) all over the input. In my case, I deal with a lot of data, so the runtime is for me - and I have the habit of overflowing my tmp / folder with temporary sort files.

I would prefer it to go O (n log k), using, for example, a heap that would apparently go faster, and also reduce the working set memory to k.

Is there any combination of standard command line tools that can do this efficiently, without me something code-based itself? Ideally, this would support the full expressive power of sorting the sort command. sort (at least by ubuntu) does not seem to have the document document it refers to in order to disable it ...

+6
source share
3 answers

Based on the foregoing, and joking a little more, I would say that the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools that you have with their current performance, or you can write your own tool.

I discuss tracking source code sorting and offering a patch. In the meantime, in case this quick hack code helps anyone who does something similar to what I'm doing, here is what I wrote for myself. Not the best python and a very shadow benchmark: I offer it to anyone who wants to provide a more rigorous one:

  • 256 files, about 1.6 gigabytes in size, all sitting on ssd, lines separated by \ n, lines of the format [^ \ t] * \ t [0-9] +
  • Ubuntu 10.4, 6 cores, 8 gigs of RAM, / tmp on ssd.
  • $ time sort -t^v<tab> -k2,2n foo* | tail -10000
    • real 7m26.444s
    • user 7m19.790s
    • sys 0m17.530s
  • $ time python test.py 10000 foo*
    • real 1m29.935s
    • user 1m28.640s
    • sys 0m1.220s
  • using diff for analysis, these two methods are different from tie-break, but otherwise the sort order is the same.

test.py:

 #!/usr/bin/env python # test.py from sys import argv import heapq from itertools import chain # parse N - the size of the heap, and confirm we can open all input files N = int(argv[1]) streams = [open(f, "r") for f in argv[2:]] def line_iterator_to_tuple_iterator(line_i): for line in line_i: s,c = line.split("\t") c = int(c) yield (c, s) # use heap to process inputs rez = heapq.nlargest(N, line_iterator_to_tuple_iterator(chain(*streams)), key=lambda x: x[0]) for r in rez: print "%s\t%s" % (r[1], r[0]) for s in streams: s.close() 
+2
source

UNIX / Linux provides a universal set of tools. For large datasets, it loads I / O data. He will do whatever you want, but slowly. If we had an idea of ​​the input, it would help a lot.

IMO, you have a choice, no one will like you.

  • pre-sort on the principle of "radix", for example, awk write all the lines whose keys begin with "A" in one file "B" on another, etc. Or, if you're just β€œP,” β€œD,” and β€œQ,” try awk to just suck what you want. Then do a complete sort by a small subset. This creates 26 files named A, B ... Z

    awk '{print $ 0> substr ($ 0,1,1)} bigfile; sort [options here] PDQ> result

  • Spend $$: (Example) Buy CoSort from iri.com any other sorting software. These types use all kinds of optimizations, but they are not free, like bash. You can also buy an SSD, which speeds up sorting on disk by several orders of magnitude. 5000iops now up to 75000iops . Use the TMPDIR to put your tmp files on SSD, read and write only on SSD. But use the existing UNIX toolkit.

  • Use any software such as R or strata, or, preferably, a database; they are all designed for large data sets.

  • Do what you are doing now, but watch on YouTube while UNIX sorting is in progress.

IMO, you use the wrong tools for big data sets when you need fast results.

+1
source

Here's a crude private solution:

 #!/usr/bin/perl use strict; use warnings; my @lines = (); while (<>) { push @lines, $_; @lines = sort @lines; if (scalar @lines > 10) { pop @lines; } } print @lines; 

It reads input only once, constantly maintaining a sorted array of the top 10 rows.

Sorting the entire array each time is inefficient, of course, but I assume that for gigabyte input it will still be significantly faster than sort huge-file | head sort huge-file | head .

Adding an option to change the number of printed lines will be quite simple. Adding options to control sorting will be a little more difficult, although I would not be surprised if there is something in CPAN that helps with this.

More abstractly, one approach to getting only the first N sorted items from a large array is to use partial Quicksort, where you don't need to sort the correct section if you don't need to. This requires storing the entire array in memory, which is probably impractical in your case.

You can break the input into pieces of medium size, apply some clever algorithm to get the top N lines of each fragment, combine the fragments together, and then apply the same algorithm to the result. Depending on the size of the pieces sort ... | head sort ... | head can be smart enough. It’s easy to build a script command using split -l ...

(Add more manual span if necessary).

Disclaimer: I just tried this in a much smaller file than what you are working with (about 1.7 million lines), and my method was slower than sort ... | head sort ... | head .

0
source

All Articles