Parallel while loop with arrays read from file in bash

Question

Parallel while loop with arrays read from file in bash

My while loop in Bash is being processed as follows:

while IFS=$'\t' read -r -a line; do myprogram ${line[0]} ${line[1]} ${line[0]}_vs_${line[1]}.result; done < fileinput

It reads from a file with this structure, for reference:

 foo bar baz foobar

etc. (with tab delimiters).

I would like to parallelize this loop (since there are many records, and processing can be slow) using the GNU parallel, however, in the examples it is not clear how I will assign each row to an array, as I am here.

What would be the possible solution (alternatives to GNU parallel operation)?

+9

bash parallel-processing gnu-parallel

Einar May 16, '13 at 15:15

source share

3 answers

I would like @chepner hack. And it is not so difficult to perform similar behavior with a limited number of parallel executions:

 while IFS=$'\t' read -r f1 f2; do myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" & # At most as number of CPU cores [ $( jobs | wc -l ) -ge $( nproc ) ] && wait done < fileinput wait

It limits the execution with the maximum number of CPU cores present in the system. You can easily change this by replacing $( nproc ) with the desired amount.

In the meantime, you should understand that this is not an honest distribution. Thus, it does not start a new thread immediately after completion. Instead, he simply waits for the completion of everything, after the start of the maximum amount. Thus, the total throughput may be slightly less than with parallel. Especially if the running time of your program can vary over a wide range. If the time spent on each call is almost the same, then the total time should also be approximately equivalent.

+5

Hubbitus Oct 10 '15 at 20:23

source share

parallel is not strictly necessary here; just run all the processes in the background and then wait for them to complete. An array is also not needed, since you can give read more than one variable:

 while IFS=$'\t' read -r f1 f2; do myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" & done < fileinput wait

This starts one job for each item in your list, while parallel can limit the number of jobs that are run at a time. You can do the same in bash , but it's complicated.

+3

chepner May 16 '13 at 18:18

source share

Ole tange · Accepted Answer · 2013-05-16T16:26:55+0000

From https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input :

""
The contents of table_file.tsv:

 foo<TAB>bar baz <TAB> quux

For start:

 cmd -o bar -i foo cmd -o quux -i baz

you can run:

 parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}

""

So in your case it will be:

 cat fileinput | parallel --colsep '\t' myprogram {1} {2} {1}_vs_{2}.result

Parallel while loop with arrays read from file in bash

More articles: