Process start from bash script failed

I have a central server where I periodically run a script (from cron) that checks remote servers. The check is performed sequentially, so first one server, then another ....

This script (from the central server) runs another script (allows you to call its update.sh) on the remote computer and that the script (on the remote computer) does something like this:

processID=`pgrep "processName"` kill $processID startProcess.sh 

The process is destroyed, and then in the startProcess.sh script it starts as follows:

 pidof "processName" if [ ! $? -eq 0 ]; then nohup "processName" "processArgs" >> "processLog" & pidof "processName" if [! $? -eq 0]; then echo "Error: failed to start process" ... 

Update.sh, startprocess.sh, and the actual binary of the process that it launches are located in NFS mounted from a central server.

Now, what happens sometimes is that the process that I try to start in startprocess.sh does not start, and I get an error. The strange part is that it is random, once a process starts on one machine, and another time on the same machine does not start. I check about 300 servers and errors are always random.

There is another thing: remote servers are located in 3 different geographical locations (2 in America and 1 in Europe), the central server is in Europe. From what I have discovered so far, is that servers in America have much more errors than in Europe.

At first, I thought that the error should have something to do with the kill, so I added sleep between kill and startprocess.sh, but that didn't make any difference.

It also seems that the process from startprocess.sh does not start at all, or something happens to it immediately when it starts, because there is no way out in the log file and there should be output in the log file.

So here I ask for help

Does anyone have such a problem or know what might be wrong?

Thanks for any help

+3
linux bash
20 Oct '14 at 22:09
source share
1 answer

(Sorry, but my original answer was pretty wrong ... Here is the fix)

Using $? to get the exit status of the background process in startProcess.sh leads to an incorrect result. Man bash :

 Special Parameters ? Expands to the status of the most recently executed foreground pipeline. 

As you mentioned in your comment, the correct way to get the exit status of a background process is to use the built-in wait . But for this, bash must handle the SIGCHLD signal.

I made a small test environment to show how it can work:

Here is the script loop.sh to run in the background:

 #!/bin/bash [ "$1" == -x ] && exit 1; cnt=${1:-500} while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done 

If arg is -x , then it exits with an exit status of 1 to simulate an error. If arg is num, then it waits num * 5 seconds to print SLEEPING [<PID>] <counter>/<max_counter> to stdout.

The second is the script launcher. It runs 3 loop.sh scripts in the background and prints their exit status:

 #!/bin/bash handle_chld() { local tmp=() for i in ${!pids[@]}; do if [ ! -d /proc/${pids[i]} ]; then wait ${pids[i]} echo "Stopped ${pids[i]}; exit code: $?" unset pids[i] fi done } set -o monitor trap "handle_chld" CHLD # Start background processes ./loop.sh 3 & pids+=($!) ./loop.sh 2 & pids+=($!) ./loop.sh -x & pids+=($!) # Wait until all background processes are stopped while [ ${#pids[@]} -gt 0 ]; do echo "WAITING FOR: ${pids[@]}"; sleep 2; done echo STOPPED 

The handle_chld function will process SIGCHLD signals. The monitor setting allows a non-interactive script to receive SIGCHLD. Then a trap is set for the SIGCHLD signal.

Then the background processes are started. All their PIDs are stored in the pids array. If SIGCHLD is received, it is checked among / proc / directories whose child process has been stopped (missing) (it could also be checked using kill -0 <PID> bash built-in). After waiting, the exit status of the background process is stored in the well-known pseudo-variable $? .

The main script waits for all pids to stop (otherwise it would not be able to get the exit status of its children), and it stops.

Output Example:

 WAITING FOR: 13102 13103 13104 SLEEPING [13103]: 1/2 SLEEPING [13102]: 1/3 Stopped 13104; exit code: 1 WAITING FOR: 13102 13103 WAITING FOR: 13102 13103 SLEEPING [13103]: 2/2 SLEEPING [13102]: 2/3 WAITING FOR: 13102 13103 WAITING FOR: 13102 13103 SLEEPING [13102]: 3/3 Stopped 13103; exit code: 0 WAITING FOR: 13102 WAITING FOR: 13102 WAITING FOR: 13102 Stopped 13102; exit code: 0 STOPPED 

You can see that the exit codes are reported correctly.

Hope this helps a bit!

+4
Oct 21 '14 at 8:06
source share



All Articles