Why does "killpg" return "not allowed" when ownership is right?

Question

Why does "killpg" return "not allowed" when ownership is right?

I have code that fork() s calls setsid() on a child element and starts some processing. If any of the children exits ( waitpid(-1, 0) ), I kill all the child process groups:

 child_pids = [] for child_func in child_functions: pid = fork() if pid == 0: setsid() child_func() exit() else: child_pids.append(pid) waitpid(-1, 0) for child_pid in child_pids: try: killpg(child_pid, SIGTERM) except OSError as e: if e.errno != 3: # 3 == no such process print "Error killing %s: %s" %(child_pid, e)

However, sometimes the killpg call killpg interrupted if the "operation is not allowed":

  Error killing 22841: [Errno 1] Operation not permitted

Why can this happen?

Full working example:

  from signal import SIGTERM
 from sys import exit
 from time import sleep
 from os import *

 def slow ():
     fork ()
     sleep (10)

 def fast ():
     sleep (1)

 child_pids = []
 for child_func in [fast, slow, slow, fast]:
     pid = fork ()
     if pid == 0:
         setsid ()
         child_func ()
         exit (0)
     else:
         child_pids.append (pid)

 waitpid (-1, 0)
 for child_pid in child_pids:
     try:
         killpg (child_pid, SIGTERM)
     except OSError as e:
         print "Error killing% s:% s"% (child_pid, e)

What gives:

  $ python killpg.py
 Error killing 23293: [Errno 3] No such process
 Error killing 23296: [Errno 1] Operation not permitted

+7

python unix bsd macos

David wolever 20 sept '12 at 22:16

source share

3 answers

You obviously cannot kill a process group consisting of zombies. When the process ends, it becomes a zombie until someone calls waitpid on it. Typically, init will own children whose parents have died to escape orphaned zombies.

So, the process is still around in a sense, but it does not receive processor time and ignores any kill commands sent directly to it. However, if the process group consists entirely of zombies, the behavior seems to be that killing the process group causes EPERM instead of a silent failure. Note that killing a non-zombie process group still succeeds.

An example program demonstrating this:

 import os import time res = os.fork() if res: time.sleep(0.2) pgid = os.getpgid(res) print pgid while 1: try: print os.kill(-pgid, 9) except Exception, e: print e break print 'wait', os.waitpid(res, 0) try: print os.kill(-pgid, 9) except Exception, e: print e else: os.setpgid(0, 0) while 1: pass

The output looks like

 56621 None [Errno 1] Operation not permitted wait (56621, 9) [Errno 3] No such process

Parent kills SIGKILL baby, then tries again. The second time he receives EPERM , so he expects a child (reaps him and destroys his group of processes). So, the third kill creates an ESRCH as expected.

+5

nneonneo 21 sept '12 at 0:13

source share

From adding more entries, it seems that sometimes killpg returns EPERM instead of ESRCH:

 #!/usr/bin/python from signal import SIGTERM from sys import exit from time import sleep from os import * def slow(): fork() sleep(10) def fast(): sleep(1) child_pids = [] for child_func in [fast, slow, slow, fast]: pid = fork() if pid == 0: setsid() print child_func, getpid(), getuid(), geteuid() child_func() exit(0) else: child_pids.append(pid) print waitpid(-1, 0) for child_pid in child_pids: try: print child_pid, getpgid(child_pid) except OSError as e: print "Error getpgid %s: %s" %(child_pid, e) try: killpg(child_pid, SIGTERM) except OSError as e: print "Error killing %s: %s" %(child_pid, e)

Whenever killpg does not work with EPERM, getpgid has not previously been run with ESRCH. For example:

 <function fast at 0x109950d70> 26561 503 503 <function slow at 0x109950a28> 26562 503 503 <function slow at 0x109950a28> 26563 503 503 <function fast at 0x109950d70> 26564 503 503 (26564, 0) 26561 Error getpgid 26561: [Errno 3] No such process Error killing 26561: [Errno 1] Operation not permitted 26562 26562 26563 26563 26564 Error getpgid 26564: [Errno 3] No such process Error killing 26564: [Errno 3] No such process

I have no idea why this is happening - be it legal behavior or a mistake in Darwin (inherited from FreeBSD or otherwise), etc.

It seems you could get around this this way by double checking EPERM by calling kill(child_pid, 0) ; if this returns an ESRCH, there is no real resolution problem. Of course, this looks pretty ugly in code:

 for child_pid in child_pids: try: killpg(child_pid, SIGTERM) except OSError as e: if e.errno != 3: # 3 == no such process if e.errno == 1: try: kill(child_pid, 0) except OSError as e2: if e2.errno != 3: print "Error killing %s: %s" %(child_pid, e) else: print "Error killing %s: %s" %(child_pid, e)

+1

abarnert 20 sept '12 at 23:34

source share

Steve kehlet · Accepted Answer · 2012-09-20T23:51:50+0000

I added some debugging ( slightly modified source ). This happens when you try to kill a process group that is already out and in Zombie status. Oh, and it only repeats easily with [fast, fast] .

 $ python so.py spawned pgrp 6035 spawned pgrp 6036 Reaped pid: 6036, status: 0 6035 6034 6035 Z (Python) 6034 521 6034 S+ python so.py 6037 6034 6034 S+ sh -c ps -e -o pid,ppid,pgid,state,command | grep -i python 6039 6037 6034 R+ grep -i python killing pg 6035 Error killing 6035: [Errno 1] Operation not permitted 6035 6034 6035 Z (Python) 6034 521 6034 S+ python so.py 6040 6034 6034 S+ sh -c ps -e -o pid,ppid,pgid,state,command | grep -i python 6042 6040 6034 S+ grep -i python killing pg 6036 Error killing 6036: [Errno 3] No such process

Not sure how to deal with this. Perhaps you can put waitpid in a while loop to get all the completed child processes, and then continue pgkill () otherwise.

But the answer to your question is: you get EPERMs because you are not allowed to kill the leader of the zombie process group (at least on Mac OS).

In addition, this is checked outside of python. If you slept there, find the pgrp of one of these zombies and try to kill its process group, you will also get EPERM:

 $ kill -TERM -6115 -bash: kill: (-6115) - Operation not permitted

It is confirmed that this also does not happen on Linux.

Why does "killpg" return "not allowed" when ownership is right?

More articles: