\t' read -r -a line; do m...">

Parallel while loop with arrays read from file in bash

My while loop in Bash is being processed as follows:

while IFS=$'\t' read -r -a line; do myprogram ${line[0]} ${line[1]} ${line[0]}_vs_${line[1]}.result; done < fileinput 

It reads from a file with this structure, for reference:

 foo bar baz foobar 

etc. (with tab delimiters).

I would like to parallelize this loop (since there are many records, and processing can be slow) using the GNU parallel, however, in the examples it is not clear how I will assign each row to an array, as I am here.

What would be the possible solution (alternatives to GNU parallel operation)?

+9
source share
3 answers

From https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input :

""
The contents of table_file.tsv:

 foo<TAB>bar baz <TAB> quux 

For start:

 cmd -o bar -i foo cmd -o quux -i baz 

you can run:

 parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1} 

""

So in your case it will be:

 cat fileinput | parallel --colsep '\t' myprogram {1} {2} {1}_vs_{2}.result 
+9
source

I would like @chepner hack. And it is not so difficult to perform similar behavior with a limited number of parallel executions:

 while IFS=$'\t' read -r f1 f2; do myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" & # At most as number of CPU cores [ $( jobs | wc -l ) -ge $( nproc ) ] && wait done < fileinput wait 

It limits the execution with the maximum number of CPU cores present in the system. You can easily change this by replacing $( nproc ) with the desired amount.

In the meantime, you should understand that this is not an honest distribution. Thus, it does not start a new thread immediately after completion. Instead, he simply waits for the completion of everything, after the start of the maximum amount. Thus, the total throughput may be slightly less than with parallel. Especially if the running time of your program can vary over a wide range. If the time spent on each call is almost the same, then the total time should also be approximately equivalent.

+5
source

parallel is not strictly necessary here; just run all the processes in the background and then wait for them to complete. An array is also not needed, since you can give read more than one variable:

 while IFS=$'\t' read -r f1 f2; do myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" & done < fileinput wait 

This starts one job for each item in your list, while parallel can limit the number of jobs that are run at a time. You can do the same in bash , but it's complicated.

+3
source

All Articles