也谈僵尸进程

最新推荐文章于 2022-12-30 19:47:47 发布

nehc

最新推荐文章于 2022-12-30 19:47:47 发布

阅读量816

点赞数

分类专栏：杂谈文章标签： defunct linux Linux LINUX zombie

本文链接：https://blog.csdn.net/cenziboy/article/details/8437466

版权

杂谈专栏收录该内容

32 篇文章 0 订阅

订阅专栏

一. 何为僵尸进程

僵尸进程 zombie 或 defunct ，ps , top 命令可以看到。说的是处于“僵死”状态的进程。这样的进程已经死亡，但仍然以某种方式存活着。说其已经死亡，是因为其资源（内存、外设链接等）已经释放，其无法也绝不会再次运行。说其存活着，是因为系统进程表中仍然存在该进程描述符。

看看Linux定义的进程状态： include/linux/sched.h

/*
 * Task state bitmask. NOTE! These bits are also
 * encoded in fs/proc/array.c: get_task_state().
 *
 * We have two separate sets of flags: task->state
 * is about runnability, while task->exit_state are
 * about the task exiting. Confusing, but this way
 * modifying one set can't modify the other one by
 * mistake.
 */
#define TASK_RUNNING            0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define __TASK_STOPPED          4
#define __TASK_TRACED           8
/* in tsk->exit_state */
#define EXIT_ZOMBIE             16
#define EXIT_DEAD              32
/* in tsk->state again */
#define TASK_DEAD               64
#define TASK_WAKEKILL       128
#define TASK_WAKING       256

看到编号16的 EXIT_ZOMBIE 了吧，没错，僵死本来就是进程的一个状态，所以出现僵尸进程也就不足为奇的了。

二. 如何产生僵尸进程

那么，僵尸进程又是如何产生的呢？这就得从Linux系统进程的创建和销毁方式说起了。进程的创建就如人的出生，每个人都一样，不存在生的伟大只说。但销毁就不同了，要给其处理后事，这个也很像人类，不同的人死后处理后事复杂度是有很大差距的。

一个进程的销毁需要经过两个阶段：

1. 进程终止（main 函数中return 或程序执行 exit) 或被杀死（信号 SIGTERM，SIGKILL）

2. 该进程的父进程在子进程终止时必须调用或已经调用wait4 (wait, waitpid)系统调用, 该系统调用使内核释放为子进程保留的资源。

只有在1成立且2不成立的条件下，才会出现僵尸进程，也就是说进程终止（1）后，其进程描述符尚未从进程表删除之前，就是所谓的僵尸进程。僵尸进程可能稳定地存在于进程表中，直至系统重启。

三. 如何避免僵尸进程

首先说明一下，僵死是每个进程（init除外）必须经历的，此处要讨论的是怎么尽快处理掉僵尸进程。

1. 父进程调用 wait4() 或waipid() 系统调用，还有其他wait() 类库函数，如wait3() 和 wait() 但在Linux中这些库函数是靠 wait4() 和 watipid() 系统调用来实现的

由于父进程调用wait()会阻塞，故经典的做法是父进程注册SIGCHLD信号，在信号处理函数中wait()

void ouch(int sig)
{
    wait();  // 在信号处理函数中wait()
}


main()
{
    signal(SIGCHLD, ouch);
    // do what you wat
}

2. 两次fork ，利用孤儿进程（orphan process )

具体方法为首先用父进程fork()一个子进程，同时父进程阻塞等待。然后让子进程立刻fork()一个孙子进程，并让子进程则立刻退出。用孙子进程来处理事务。这时候由于子进程已经退出，孙子进程就变成了孤儿进程，被init领养。而子进程立刻退出后，父进程收到信号并正确销毁了子进程。

详见APUE相关章节

3. 父进程忽略 SIGCHLD 信号（这个使用有限制，貌似仅限于Linux 2.6和更高内核）

    signal(SIGCHLD, SIG_IGN); // 忽略SIGCHLD信号
    pid = fork(); // 这里的子进程就不会僵尸了

man sigaction 中有这样的描述：

POSIX.1-1990  disallowed setting the action for SIGCHLD to SIG_IGN.  POSIX.1-2001 allows this possibility,
so that ignoring SIGCHLD can be used to prevent the creation of zombies (see wait(2)).  Nevertheless,  the
historical  BSD  and  System V behaviors for ignoring SIGCHLD differ, so that the only completely portable
method of ensuring that terminated children do not become zombies is to catch the SIGCHLD signal and
perform a wait(2) or similar.

man 2 wait 中这样的

POSIX.1-2001  specifies  that  if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is
set for SIGCHLD (see sigaction(2)), then children that terminate do not  become  zombies  and  a  call  to
wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD.
(The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified.  Note that  even
though  the  default  disposition  of  SIGCHLD  is "ignore", explicitly setting the disposition to SIG_IGN
results in different treatment of zombie process children.)  Linux 2.6  conforms  to  this  specification.
However,  Linux  2.4  (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being
ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the  call  blocks  until
the next child terminates and then returns the process ID and status of that child.

如若，父进程已经忽略了 SIGCHLD 信号，然后又调用了 wait4() 或 waitpid() 会怎样呢？

kernle/signal.c 中函数do_notify_parent(struct task_struct *tsk, int sig) 有这样的说明：

/*
* We are exiting and our parent doesn't care.  POSIX.1 defines special semantics for setting SIGCHLD to SIG_IGN
* or setting the SA_NOCLDWAIT flag: we should be reaped automatically and not left for our parent's wait4 call.
* Rather than having the parent do it as a magic kind of signal handler, we just set this to tell do_exit that
* we can be cleaned up without becoming a zombie.  Note that we still call __wake_up_parent in this case,
* because a blocked sys_wait4 might now return -ECHILD.
*
* Whether we send SIGCHLD or not for SA_NOCLDWAIT is implementation-defined: we do (if you don't want
* it, just use SIG_IGN instead).
*/

那么，为什么父进程signal(SIGCHLD, SIG_IGN);后子进程就不会僵尸了呢？难道是子进程继承了父进程忽略SIGCHLD属性的原因。

我以为子进程在退出时执行exit() 函数会检查SIGCHLD是否忽略。如果忽略，内核就会在do_exit() 函数中清理子进程的所有资源。

做了个小实验，证明我错了

int main()
{
    int i=0;
    pid_t pid = 0;
    signal(SIGCHLD, SIG_IGN); //父进程 ignore SIGCHLD
    pid = fork();
    if(pid > 0)
    {
        puts("I am parent");
        for(i=0; i<60; i++)
        {
            printf("parent: %d\n",i);
            sleep(1);
        }
    }
    else if(pid == 0)
    {
        signal(SIGCHLD, SIG_DFL); //子进程恢复SIGCHLD默认行为
        puts("I am child");
        exit(0);
    }
    return 0;
}

程序执行后，并没有僵尸进程出现。可见设置父进程signal(SIGCHLD, SIG_IGN);并没有影响到子进程的行外，起码在子进程exit()时没有影响。

那么，到底是什么原因呢？