Sorting lines of a massive file by the number of words in a line (perfectly parallel)

I am working on a community discovery algorithm to analyze social network data from Facebook. The first task, which detects all clicks on the chart, can be performed effectively in parallel and leaves me with the conclusion as follows:

17118 17136 17392 17064 17093 17376 17118 17136 17356 17318 12345 17118 17136 17356 17283 17007 17059 17116 

Each of these lines represents a unique clique (a set of node ids), and I want to sort these lines in descending order by the number of identifiers per line. In the case of the above example, what the output looks like:

 17118 17136 17356 17318 12345 17118 17136 17356 17283 17118 17136 17392 17064 17093 17376 17007 17059 17116 

(Links --- that is, rows with the same number of identifiers --- can be sorted arbitrarily.)

What is the most efficient way to sort these lines.

Remember the following points:

  • The file I want to sort may be larger than the physical memory of the machine.
  • Most of the machines I run this have several processors, so it will be the perfect solution for parallel operation
  • The ideal solution would be just a shell script (possibly using sort ), but I am open to simple solutions in python or perl (or in any language, while this simplifies the task)
  • This task is in a sense very simple --- I'm not just looking for some old solution, but rather for a simple and above all effective solution

UPDATE 2: best solution

Based on the benchmarking of the proposed solutions (see below), here is the best solution (taken from Vlad, who, in turn, adapted it to the other solutions offered here). He is pretty smart and doesn't even use sorting.

 for FILE in infile.* ; do awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \ FILE=`basename $FILE` $FILE& done wait ls -1r tmpfile.* | xargs cat >outfile rm -f tmpfile.* 

UPDATE 1: Results of a comparative analysis of the proposed solutions

For benchmarking, I took Cliques, found on Oklahoma's Facebook network. Unsorted files containing these clicks look the same as the first example shown above, containing 46,362,546 lines, resulting in file sizes up to 6.4 GB. Clicks are almost evenly distributed across 8 files. The system in which I am testing this contains 4 physical processors, each of which has 6 cores and 12 MB of cache in the second level, a total of 24 cores. It also contains 128 GB of physical memory. Since the sort lines were divided into 8 files, most of these solutions used 8 (or 16) parallel processes.

Ignoring the first naive approach, I compared the last 5 suggestions of Vlad Romaskan (the solution I chose).

The first solution was not very effective:

 real 6m35.973s user 26m49.810s sys 2m14.080s 

I tried using solutions 2, 3 and 4 that use FIFO files, but each of them used only one sorting process and thus took a lot of time (and therefore I killed them before they could finish) /

The last solution was the fastest:

 real 1m3.272s user 1m21.540s sys 1m22.550s 

Please note that the user time for this solution is 1 m21, much better than the first solutions in 26 minutes.

+5
sorting unix shell
source share
7 answers

A naive approach can be simple:

 awk '{ print NF " " $0 }' infile| sort -k1,1nr | awk '{ $1=""; print $0 }' >outfile 

This will support up to 3 processors. sort not limited by the amount of physical memory available, use the -S and -T switches to configure how much memory to use ( -S ) before resorting to temporary files in the temp ( -T ) directory on a sufficiently large (and ideally fast) partition.

If you can create several input files by dividing the work preceding the sorting phase, you can:

 for FILE in infile.* ; do awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp& done wait sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile rm -f infile.*.tmp 

This will use up to N*2 CPUs; moreover, the last sort (merge-sort) is very efficient.

Further, in order to improve parallelism to N*2+1 using FIFO instead of intermediate files, it is again assumed that several input files are possible:

 for FILE in infile.* ; do mkfifo $FILE.fifo awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo& done sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile rm -f infile.*.fifo 

If several input files are not possible , you can simulate them (adding I / O overhead, which we hope will be amortized by the number of processes available): p>

 PARALLELISM=5 # I want 5 parallel instances for N in `seq $PARALLELISM` ; do mkfifo infile.$N.fifo awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile | sort -k1,1nr >infile.$N.fifo& done sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile rm -f infile.*.fifo 

Since we use the line number modulo, we have good locality, and the file system cache should ideally bring the cost of reading the input file over and over in $PARALLELISM processes closer to zero.

Even better , reading the input file only once and concatenating the input lines into multiple sort lines:

 PARALLELISM=5 # I want 5 parallel instances for N in `seq $PARALLELISM` ; do mkfifo infile.$N.fifo1 mkfifo infile.$N.fifo2 sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2& done awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile& sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile rm -f infile.$N.fifo[12] 

You must measure the performance for the various values ​​of $PARALLELISM , and then select the optimal one.

EDIT

As shown in other posts, you can of course use cut instead of the final awk (i.e. which removes the first column) for potentially better performance. :)

EDIT2

All scripts for the naming convention for the files you provided are updated and bugs are fixed in the latest version.

Also, using the new file name convention, if I / O is not a bottleneck, then a very minor variation in dave / niry solutions should probably be even more efficient:

  for FILE in infile.* ; do awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \ FILE=`basename $FILE` $FILE& done wait ls -1r tmpfile.* | xargs cat >outfile rm -f tmpfile.* 
+11
source share

I wonder how fast it would be:

 #!/bin/sh rm -rf /tmp/fb mkdir /tmp/fb cd /tmp/fb awk '{ print $0 > NF }' ls | sort -nr | xargs cat 

Doesn't use a lot of cores, however.

+5
source share

Since you do not need to sort, just copy to buckets, you can split the files by the number of tokens, this will be the fastest:

 perl -ne 'split/\s+/;$t=$#_+1;open $f[$t], sprintf(">%09d",$t) if $f[$t] eq "";$f=$f[$t];print $f $_;' cat `ls -1r 0*` 

btw, the drive will be a bottleneck, the number of cores and usage are not significant.

+1
source share

For reference, I need to add that since version 8.6 (2010) GNU coreutils (including sorting) supports multi-threaded sorting. By default, I think (since v8.6) it will use the number of cores as the number of threads, but you can specify a different number with

sort <file> --parallel=<N>

+1
source share
 awk '{print length,$0}' test.txt | sort -nr | cut -d" " -f2- 

Not sure how well this will work, although sorting may work with memory limitations, AFAIK.

0
source share

To create something effective, I would do something like the following: biprocess parsing the file:

In the first pass, line by line is read, writing down three things: line number, file offset, and word count. This can be parallelized without too much difficulty (for tasks starting with "random" lines in the file, just add the appropriate starting numbers for the afterword).

Now sort the list of three recorded things by the number of words per line. Then iterate over the list, looking for the appropriate starting offset.

In terms of performance, all searches can be slow, but it should be relative light on memory consumption, only for 3 lines for each line.

0
source share

Not sure if I understood the question correctly, but I think an approach similar to quicksort might help:

 10 split the file into N subfiles, one for each core/cpu 20 sort each partial file using the solutions suggested in some of the answers here 30 once every file is split and sorted, get the first line from each file and put it into a temporary file 40 get the second line from each file, put them in a second temporary file 50 repeat until you have number of temp files == number of cores 60 GOTO 20 

Depending on the number of passes, you should approach a perfectly sorted file.

Please note that this is not an ideal solution . However, even for several passes, it should give you a fairly well-sorted list of the longest lines in the first temporary file (I assume the Gaussian distribution of the line lengths in the original long file).

ps: if the partial files are still larger than the available memory, separate them until they fit (depending on the sorting algorithm you use for each file, tho). But in this case you need to double the number of passes to get a reasonable approximation

ps2: I also assume that you are not interested in a perfectly sorted file, but more in the statistical significance of the data (that is, how long the long lines average, etc.).

0
source share

All Articles