Easy parallelization

I often find that I write simple loops to perform operations on many files, for example:

for i in `find . | grep ".xml$"`; do bzip2 $i; done 

It seems a little depressing that my 4-core machine uses only one core .. is there an easy way to add parallelism to my shell script?

EDIT: To present a little more context for my problems, I'm sorry I was not more clear to start with!

I often want to run simple (ish) scripts, for example, plot, compress or decompress, or run some program on data sets of a reasonable size (usually from 100 to 10,000). The scripts that I use to solve such problems look like above, but may have a different command or even a sequence of commands to execute.

For example, just now I run:

 for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done 

So, my problems are by no means bzip specific! (Although parallel bzip looks cool, I intend to use it in the future).

+6
bash parallel-processing
source share
8 answers

If you needed to solve the problem today, you will probably use a tool like GNU Parallel (if there is no special specialized tool for your task, for example pbzip2 ):

 find . | grep ".xml$" | parallel bzip2 

To learn more:

+1
source share

Solution . Use xargs for parallel operation (don't forget the -n option!)

 find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2 
+14
source share

This perl program fits your needs quite well, you just do it:

 runN -n 4 bzip2 `find . | grep ".xml$"` 
+6
source share

gnu make has a nice parallelism function (e.g. -j 5) that will work in your case. Create a Makefile

 %.xml.bz2 : %.xml all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') ) 

then do

 nice make -j 5 

replace "5" with a number, probably 1 more than the number of processors. You might want to make it "enjoyable" in case someone wants to use the car while you are on it.

+4
source share

The answer to the general question is complex, because it depends on the details of what you are parallelizing. On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (most likely pbzip2 is already installed, or at least in the repositories or your distribution). See here for more details: http://compression.ca/pbzip2/

+2
source share

I find such an operation counterproductive. The reason is that the more processes access the disk at the same time, the higher the read / write time, so the end result ends in a longer time. The bottleneck here will not be a CPU problem, no matter how many cores you have.

Didn't you make two large copies of files on the same hard drive at the same time? I usually copy one faster and then the other.

I know that this task involves some processor power (bzip2 requires a complicated compression method), but try measuring the load of the first processor before moving on to the “difficult” path, which we all prefer to choose more often than necessary.

+2
source share

I did something like this for bash. The parallel trick is probably much faster for one-time use, but here is the main section of code for implementing something similar in bash, you will need to modify it for your own purposes:

 #!/bin/bash # Replace NNN with the number of loops you want to run through # and CMD with the command you want to parallel-ize. set -m nodes=`grep processor /proc/cpuinfo | wc -l` job=($(yes 0 | head -n $nodes | tr '\n' ' ')) isin() { local v=$1 shift 1 while (( $# > 0 )) do if [ $v = $1 ]; then return 0; fi shift 1 done return 1 } dowait() { while true do nj=( $(jobs -p) ) if (( ${#nj[@]} < nodes )) then for (( o=0; o<nodes; o++ )) do if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi done return; fi sleep 1 done } let x=0 while (( x < NNN )) do for (( o=0; o<nodes; o++ )) do if (( job[o] == 0 )); then break; fi done if (( o == nodes )); then dowait; continue; fi CMD & let job[o]=$! let x++ done wait 
+2
source share

I think you could do the following

 for i in `find . | grep ".xml$"`; do bzip2 $i&; done 

But this could speed up the process, since you have the files instantly, and this is not optimal, since just starting four processes at a time.

+1
source share

All Articles