Explicit sorting of sorting using xargs - Incomplete results from xargs - max-procs

Context

I need to optimize deduplication using 'sort -u', and my Linux machine has an old implementation of the sort command (ie 5.97) that does not have the --parallel option. Although "sort" implements parallelizable algorithms (for example, merge-sort), I need to make this parallelization explicit. So I do it manually with the "xargs" command, which is superior to ~ 2.5X wrt to the only "sort -u" method ... when it works fine.

Here is the intuition of what I am doing.

I am running a bash script that splits the input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the "xargs" command to perform parallel deduplication through the sortu.sh script (details at the end). sortu.sh terminates the call to 'sort -u' and displays the resulting file name (for example, "sortu.sh file.txt.part1" displays "file.txt.part1.sorted"). Then the received sorted parts are passed to "sort -merge -u", which combines / deduplicates the input parts, assuming that such parts are already sorted.

The problem I am having is parallelization with xargs. Here is a simplified version of my code:

AVAILABLE_CORES=4 PARTS="file.txt.part1 file.txt.part2 file.txt.part3 file.txt.part4" SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \ --max-procs=$AVAILABLE_CORES \ bash sortu.sh \ ) ... #More code for merging the resulting parts $SORTED_PARTS ... 

The expected result is a list of the sorted parts in the SORTED_PARTS variable:

  echo "$SORTED_PARTS" file.txt.part1.sorted file.txt.part2.sorted file.txt.part3.sorted file.txt.part4.sorted 

Symptom

However (sometimes) there is a missing sorted part. For example, file.txt.part2.sorted:

  echo "$SORTED_PARTS" file.txt.part1.sorted file.txt.part3.sorted file.txt.part4.sorted 

This symptom is not determinate in its occurrence (that is, execution for the same .txt file completes successfully, but at another time it does not work) or in a missing file (i.e. it is not always the same sorted missing part).

Problem

I have a condition where all instances of sortu.sh end, and "xargs" sends EOF before stdout is reset.

Question

Is there a way to ensure that stdout is erased before the xagrs send the EOF?

Limitations

I cannot use the parallel command or the "- parallel" sort option.

sortu.sh code

  #!/bin/bash SORTED=$1.sorted sort -u $1 > $SORTED echo $SORTED 
+1
source share
1 answer

The following does not write the contents to disk at all and parallelizes the separation process, sorting and merging processes, performing all this at once.

This version was sent back to bash 3.2; The version built for newer versions of bash does not need eval .

 #!/bin/bash nprocs=5 # maybe call nprocs command instead? fd_min=10 # on bash 4.1, can use automatic FD allocation instead # create a temporary directory; delete on exit tempdir=$(mktemp -d "${TMPDIR:-/tmp}/psort.XXXXXX") trap 'rm -rf "$tempdir"' 0 # close extra FDs and clear traps, before optionally executing another tool. # # Doing this in subshells ensures that only the main process holds write handles on the # individual sorts, so that they exit when those handles are closed. cloexec() { local fifo_fd for ((fifo_fd=fd_min; fifo_fd < (fd_min+nprocs); fifo_fd++)); do : "Closing fd $fifo_fd" # in modern bash; just: exec {fifo_fd}>&- eval "exec ${fifo_fd}>&-" done if (( $# )); then trap - 0 exec " $@ " fi } # For each parallel process: # - Run a sort -u invocation reading from an FD and writing from a FIFO # - Add the FIFO name to a merge sort command merge_cmd=(sort --merge -u) for ((i=0; i<nprocs; i++)); do mkfifo "$tempdir/fifo.$i" # create FIFO merge_cmd+=( "$tempdir/fifo.$i" ) # add to sort command line fifo_fd=$((fd_min+i)) : "Opening FD $fifo_fd for sort to $tempdir/fifo.$i" # in modern bash: exec {fifo_fd}> >(cloexec sort -u >$fifo_fd) printf -v exec_str 'exec %q> >(cloexec; exec sort -u >%q)' "$fifo_fd" "$tempdir/fifo.$i" eval "$exec_str" done # Run the big merge sort recombining output from all the FIFOs cloexec "${merge_cmd[@]}" & merge_pid=$! # Split input stream out to all the individual sort processes... awk -v "nprocs=$nprocs" \ -v "fd_min=$fd_min" \ '{ print $0 >("/dev/fd/" (fd_min + (NR % nprocs))) }' # ...when done, close handles on the FIFOs, so their sort invocations exit cloexec # ...and wait for the merge sort to exit wait "$merge_pid" 
+1
source

All Articles