Context
I need to optimize deduplication using 'sort -u', and my Linux machine has an old implementation of the sort command (ie 5.97) that does not have the --parallel option. Although "sort" implements parallelizable algorithms (for example, merge-sort), I need to make this parallelization explicit. So I do it manually with the "xargs" command, which is superior to ~ 2.5X wrt to the only "sort -u" method ... when it works fine.
Here is the intuition of what I am doing.
I am running a bash script that splits the input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the "xargs" command to perform parallel deduplication through the sortu.sh script (details at the end). sortu.sh terminates the call to 'sort -u' and displays the resulting file name (for example, "sortu.sh file.txt.part1" displays "file.txt.part1.sorted"). Then the received sorted parts are passed to "sort -merge -u", which combines / deduplicates the input parts, assuming that such parts are already sorted.
The problem I am having is parallelization with xargs. Here is a simplified version of my code:
AVAILABLE_CORES=4 PARTS="file.txt.part1 file.txt.part2 file.txt.part3 file.txt.part4" SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \ --max-procs=$AVAILABLE_CORES \ bash sortu.sh \ ) ...
The expected result is a list of the sorted parts in the SORTED_PARTS variable:
echo "$SORTED_PARTS" file.txt.part1.sorted file.txt.part2.sorted file.txt.part3.sorted file.txt.part4.sorted
Symptom
However (sometimes) there is a missing sorted part. For example, file.txt.part2.sorted:
echo "$SORTED_PARTS" file.txt.part1.sorted file.txt.part3.sorted file.txt.part4.sorted
This symptom is not determinate in its occurrence (that is, execution for the same .txt file completes successfully, but at another time it does not work) or in a missing file (i.e. it is not always the same sorted missing part).
Problem
I have a condition where all instances of sortu.sh end, and "xargs" sends EOF before stdout is reset.
Question
Is there a way to ensure that stdout is erased before the xagrs send the EOF?
Limitations
I cannot use the parallel command or the "- parallel" sort option.
sortu.sh code
#!/bin/bash SORTED=$1.sorted sort -u $1 > $SORTED echo $SORTED