linux内核hung task检测机制原理及问题处理

mick.li

已于 2023-11-10 09:36:51 修改

阅读量2.1k

点赞数

文章标签： linux 服务器运维

于 2023-11-10 09:09:23 首次发布

本文链接：https://blog.csdn.net/kkqier/article/details/134320084

版权

本文所有代码分析基于linux-5.4.18

1 hung task

linux内核中的hung task机制，用来检查是否有任务长时间一直处于D状态（TASK_UNINTERRUPTIBLE不可中断的睡眠状态）。

如果一个处于D状态的任务被检查到超过120s(内核默认值，可修改)一直未发生调度，则认为发生了hung task，并会打印出警告信息。一些关键任务长时间处于hung状态可能会造成系统异常。

hung task出现比较多的一种io操作情况：比如linux系统中将内存高速缓存中数据回写到磁盘时间过长，导致出现hung task。

1.1 实现原理

linux系统中hung task检测，是通过内核线程khungtaskd完成的。

1. khungtaskd内核线程在内核启动阶段创建，相关代码为：

kernel/hung_task.c
static int __init hung_task_init(void)
{
    atomic_notifier_chain_register(&panic_notifier_list, &panic_block);

    /* Disable hung task detector on suspend */
    pm_notifier(hungtask_pm_notify, 0);

    /* 创建hungtask检测内核线程khungtaskd，其具体实现在watchdog()函数中 */
    watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");

    return 0;
}
subsys_initcall(hung_task_init);

2. 内核线程khungtaskd中处理函数watchdog()，其主要工作如下：

1）获取hung task检测 timeout时间和检测间隔时间interval

2）判断当前时间距离上次检查时间是否超过检测时间间隔

3）如果超过，则进行hung task检查

4）检查完毕并执行相关操作后，设置定时器休眠，等待下一次唤醒

kernel/hung_task.c
static int watchdog(void *dummy)
{
    unsigned long hung_last_checked = jiffies;

    set_user_nice(current, 0);

    for ( ; ; ) {
        /*
         * 获取hungtask检测 timeout时间和检测间隔时间
         * sysctl_hung_task_timeout_secs = CONFIG_DEFAULT_HUNG_TASK_TIMEOUT，取自内核编译配置项，默认120s,
         * 应用层可通过sysctl参数kernel.hung_task_timeout_secs或者对应的设备节点查看或者修改
         * sysctl_hung_task_check_interval_secs并未直接赋值，所以编译器默认为其赋值“0”
         */
        unsigned long timeout = sysctl_hung_task_timeout_secs;
        unsigned long interval = sysctl_hung_task_check_interval_secs;
        long t;

        if (interval == 0)
            interval = timeout;
            /*
             * interval取值为interval和timeout的最小值
             * 由于interval默认值为“0”，此时interval最终取值与timeout一样，为120s。
             * 如果用户调整interval或者timeout值后，则取调整的interval和timeout中的最小值
             */
            interval = min_t(unsigned long, interval, timeout);
            /*
             * 此时会判断timeout值是否为0
             * 如果timeout为0，t取值MAX_SCHEDULE_TIMEOUT（即LONG_MAX），
             * 后面逻辑则无法进行hungtask检测，可以视为关闭hungtask检测；
             * 如果timeout不为0，则t = 上次检查时间 + interval时间 - 当前时间，
             * 用来判断当前时间距离上次hungtask检测是否超过interval时间（默认等于timeout 120s）
             */
            t = hung_timeout_jiffies(hung_last_checked, interval);
            /* 如果t 小于等于0，说明当前时间距离上次检查时间已经超过interval值，默认为120s */
            if (t <= 0) {
                if (!atomic_xchg(&reset_hung_task, 0) &&
                            !hung_detector_suspended)
                    /* hungtask检查 */
                    check_hung_uninterruptible_tasks(timeout);
                    /* 记录本次检测时间 */
                    hung_last_checked = jiffies;
                    continue;
            }
            /*
             * 设置当前任务为TASK_INTERRUPTIBLE，设置并启动一个时长为t的定时器，
             * 然后调用schedule()让出CPU，此时任务会从就绪队列中移出。定时器超时后，唤醒任务。
             * 类似休眠函数
             */
            schedule_timeout_interruptible(t);
        }

        return 0;
}

3. Hung task检测函数check_hung_uninterruptible_tasks(timeout)

check_hung_uninterruptible_tasks()主要工作是遍历系统中所有任务，如果任务处于TASK_UNINTERRUPTIBLE，则通过check_hung_task()对任务进行hung task检查。

4. 任务hung task检查函数check_hung_task()

通过对比超过timeout时间的两次检查中间，任务的切换次数，来确定任务是否hung住。

kernel/hung_task.c
static void check_hung_task(struct task_struct *t, unsigned long timeout)
{
    /*
     * 计算本次hung task检查时，任务t的任务切换次数（调度次数）
     * nvcsw：主动切换次数；nivcsw：被动切换次数
     */
    unsigned long switch_count = t->nvcsw + t->nivcsw;

    /*
     * Ensure the task is not frozen.
     * Also, skip vfork and any other user process that freezer should skip.
     */
    if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
        return;

    /*
     * When a freshly created task is scheduled once, changes its state to
     * TASK_UNINTERRUPTIBLE without having ever been switched out once, it
     * musn't be checked.
     */
    if (unlikely(!switch_count))
        return;

    /* 
     * hung task检查，判断本次检查任务t的切换次数与上次检查切换次数是否相同，
     * 如果不相同，说明任务t在两次hung task检查期间，发生了任务切换，没有hung task，更新last_switch_count后直接返回，
     * 如果相同，后续进行时间判断：1）如果timeout时间未到不做任何操作，直接返回；2）否则认为发生了hung task，进行hung task后续处理。
     */
    if (switch_count != t->last_switch_count) {
        t->last_switch_count = switch_count;
        t->last_switch_time = jiffies;
        return;
    }
    /*
     * 判断当前时间是否未超过上次检查时间 + timeout，
     *如果是，说明timeout时间未到，直接返回，否则认为发生hung task
     */
    if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
        return;
    /* ftrace打印 */
    trace_sched_process_hang(t);
    /*
     * 根据sysctl_hung_task_panic决定发生hung task时，是否要panic，并打印出相关信息
     * sysctl_hung_task_panic根据内核编译配置CONFIG_BOOTPARAM_HUNG_TASK_PANIC取默认值
     * 应用层可以通过sysctl参数kernel.hung_task_panic或者对应设备节点进行查看或修改
     * 也可以通过内核启动参数"hung_task_panic="在内核启动解析内核参数时设置
     */
    if (sysctl_hung_task_panic) {
        console_verbose();
        hung_task_show_lock = true;
        hung_task_call_panic = true;
    }

    /*
     * hung task警告信息输出。
     * sysctl_hung_task_warnings默认值为10，默认输出10次
     * 可通过sysctl参数kernel.hung_task_warnings或者对应设备节点进行查看或修改
     */
    if (sysctl_hung_task_warnings) {
        if (sysctl_hung_task_warnings > 0)
            sysctl_hung_task_warnings--;
        pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
                       t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
        pr_err("      %s %s %.*s\n",
                    print_tainted(), init_utsname()->release,
                    (int)strcspn(init_utsname()->version, " "),
                    init_utsname()->version);
        pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
                    " disables this message.\n");
        /* 打印出任务t的调度栈 */
        sched_show_task(t);
        hung_task_show_lock = true;
    }


    touch_nmi_watchdog();
}