I am trying to test behavior that is difficult to reproduce in a controlled environment.
Use case: Linux system; usually Redhat EL 5 or 6 (we are just starting with RHEL 7 and systemd, so this is currently out of scope).
There are situations when I need to restart the service. The script we use to stop the service usually works pretty well; it sends SIGTERM to a process that is designed to process it; if the process does not process SIGTERM for a timeout (usually after a couple of minutes), the script sends SIGKILL and then waits another couple of minutes.
Problem: in some (rare) situations, the process does not exit after SIGKILL; this usually happens when it gets stuck badly in a system call, possibly due to a kernel-level problem (a damaged file system or a broken NFS file system or something equally bad, requiring manual intervention).
The error occurred when the script did not understand that the "old" process did not actually exit and started a new process while the old one was still working; we fix this with a stronger locking system (so at least the new process does not start if the old one works), but I find it difficult to check all this because I donβt know, t found a way to simulate a hard stuck process.
So the question is:
How can I manually simulate a process that does not exit when sending SIGKILL to it, even as a privileged user?
source
share