A naive approach can be simple:
awk '{ print NF " " $0 }' infile| sort -k1,1nr | awk '{ $1=""; print $0 }' >outfile
This will support up to 3 processors. sort not limited by the amount of physical memory available, use the -S and -T switches to configure how much memory to use ( -S ) before resorting to temporary files in the temp ( -T ) directory on a sufficiently large (and ideally fast) partition.
If you can create several input files by dividing the work preceding the sorting phase, you can:
for FILE in infile.* ; do awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp& done wait sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile rm -f infile.*.tmp
This will use up to N*2 CPUs; moreover, the last sort (merge-sort) is very efficient.
Further, in order to improve parallelism to N*2+1 using FIFO instead of intermediate files, it is again assumed that several input files are possible:
for FILE in infile.* ; do mkfifo $FILE.fifo awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo& done sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile rm -f infile.*.fifo
If several input files are not possible , you can simulate them (adding I / O overhead, which we hope will be amortized by the number of processes available): p>
PARALLELISM=5 # I want 5 parallel instances for N in `seq $PARALLELISM` ; do mkfifo infile.$N.fifo awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile | sort -k1,1nr >infile.$N.fifo& done sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile rm -f infile.*.fifo
Since we use the line number modulo, we have good locality, and the file system cache should ideally bring the cost of reading the input file over and over in $PARALLELISM processes closer to zero.
Even better , reading the input file only once and concatenating the input lines into multiple sort lines:
PARALLELISM=5 # I want 5 parallel instances for N in `seq $PARALLELISM` ; do mkfifo infile.$N.fifo1 mkfifo infile.$N.fifo2 sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2& done awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile& sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile rm -f infile.$N.fifo[12]
You must measure the performance for the various values ββof $PARALLELISM , and then select the optimal one.
EDIT
As shown in other posts, you can of course use cut instead of the final awk (i.e. which removes the first column) for potentially better performance. :)
EDIT2
All scripts for the naming convention for the files you provided are updated and bugs are fixed in the latest version.
Also, using the new file name convention, if I / O is not a bottleneck, then a very minor variation in dave / niry solutions should probably be even more efficient:
for FILE in infile.* ; do awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \ FILE=`basename $FILE` $FILE& done wait ls -1r tmpfile.* | xargs cat >outfile rm -f tmpfile.*