linux 内核笔记之watchdog

最新推荐文章于 2025-04-03 14:21:28 发布

羌俊恩

最新推荐文章于 2025-04-03 14:21:28 发布

阅读量2.4k

点赞数 1

文章标签： watchdog softlockup

原文链接：https://blog.csdn.net/yhb1047818384/article/details/70833825/

版权

一、概要

watchdog简而言之，watchdog是为了保证系统正常运行，或者从死循环，死锁等一场状态退出的一种机制。

看门狗分硬件看门狗和软件看门狗。硬件看门狗是利用一个定时器电路，其定时输出连接到电路的复位端，程序在一定时间范围内对定时器清零(俗称“喂狗”)，因此程序正常工作时，定时器总不能溢出，也就不能产生复位信号。如果程序出现故障，不在定时周期内复位看门狗，就使得看门狗定时器溢出产生复位信号并重启系统。软件看门狗原理上一样，只是将硬件电路上的定时器用处理器的内部定时器代替，这样可以简化硬件电路设计，但在可靠性方面不如硬件定时器，比如系统内部定时器自身发生故障就无法检测到。

软件看门狗分为两种，用于检测soft lockup的普通软狗(基于时钟中断)，以及检测hard lockup的NMI狗（基于NMI中断）。

注1：时钟中断优先级小于NMI中断
注2：lockup，是指某段内核代码占着CPU不放。Lockup严重的情况下会导致整个系统失去响应。

在linux kernel里，有一个debug选项CONFIG_HARDLOCKUP_DETECTOR。enable后就可以打开kernel中的soft lockup和hard lockup探测。soft/hard lockup的实现在kernel/watchdog.c中，主体涉及到了3个东西：kernel线程，时钟中断，NMI中断（不可屏蔽中断）。这3个东西具有不一样的优先级，依次是kernel线程 < 时钟中断 < NMI中断。而正是用到了他们之间优先级的区别，所以才可以调试系统运行中的两种问题：

由于某种原因导致系统处于内核态超过20s导致进程无法运行(soft lockup)
由于某种原因导致系统处于内核态超过10s导致中断无法运行(hard lockup)

soft lockup 和 hard lockup，它们的唯一区别是 hard lockup 发生在CPU屏蔽中断的情况下(更多参看)。

A ‘softlockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds (see “Implementation” below for details), without giving other tasks a chance to run. The current stack trace is displayed upon detection and, by default, the system will stay locked up. Alternatively, the kernel can be configured to panic; a sysctl, “kernel.softlockup_panic”, a kernel parameter,“softlockup_panic” (see “Documentation/kernel-parameters.txt” for details), and a compile option,“BOOTPARAM_SOFTLOCKUP_PANIC”, are provided for this.

A ‘hardlockup’ is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds (see “Implementation” below for details), without letting other interrupts have a chance to run. Similarly to the softlockup case, the current stack trace is displayed upon detection and the system will stay locked up unless the default behavior is changed, which can be done through a sysctl, ‘hardlockup_panic’, a compile time knob,“BOOTPARAM_HARDLOCKUP_PANIC”, and a kernel parameter, “nmi_watchdog”

二、说明

软狗：单个cpu检测线程是否正常调度。

一般软狗的正常流程如下（假设软狗触发的时间为20s）；

可能产生软狗的原因：

1.频繁处理硬中断以至于没有时间正常调度
2.长期处理软中断
3.对于非抢占式内核，某个线程长时间执行而不触发调度
4.以上all

NMI watchdog：

单个CPU检测中断是否能够正常上报
当CPU处于关中断状态达到一定时间会被判定进入hard lockup

NMI检测流程：

可能产生NMI狗的原因：
1.长期处理某个硬中断
2.长时间在禁用本地中断下处理

NMI狗机制也是用一个percpu的hrtimer来喂狗，为了能够及时检测到hard lockup状态，在比中断优先级更高的NMI上下文进行检测。

硬狗：
用于检测所有CPU是否正常运行
任何一个CPU都可以喂硬狗，当在一定时间内没有核喂狗，触发硬狗复位

硬狗检测流程：

可能产生硬狗的原因：
1.CPU（没有软狗，NMI狗触发条件）全部挂死
2.CPU之间存在硬件依赖关系，某一个CPU挂死，有软件层面的共享资源

基于内核代码watchdog.c分析soft lockup以及hard lockup的实现机制（kernel/watchdog.c）

soft lockup：

1）每一个CPU上都有一个watchdog线程（线程名为watchdog/0,watchdog/1 …）：

static struct smp_hotplug_thread watchdog_threads = {
.store = &softlockup_watchdog,
.thread_should_run = watchdog_should_run,
.thread_fn = watchdog,
.thread_comm = “watchdog/%u”,
.setup = watchdog_enable,
.park = watchdog_disable,
.unpark = watchdog_enble,
};

2）该线程定期调用watchdog函数

static void __touch_watchdog(void)
{
/更新watchdog运行时间戳/
__this_cpu_write(watchdog_touch_ts, get_timestamp());
}

static void watchdog(unsigned int cpu)
{
/更新softlock hrtimer cnt = hrtimer interrupts/
__this_cpu_write(soft_lockup_hrtimer_cnt,
__this_cpu_read(hrtimer_interrupts));
__touch_watchdog();
}

3）时间中断

static void watchdog_enable(unsigned int cpu)
{
struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);

/* kick off the timer for the hardlockup detector */
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = watchdog_timer_fn;

/* done here because hrtimer_start can only pin to smp_processor_id() */
hrtimer_start(hrtimer, ns_to_ktime(sample_period),
          HRTIMER_MODE_REL_PINNED);

}
}

该函数主要功能就是初始化一个高精度timer，唤醒watchdog 喂狗线程。

hrtimer的时间处理函数为：

static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
//watchdog上次运行的时间戳
unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
struct pt_regs *regs = get_irq_regs();
int duration;
//在唤醒watchdog kthread之前递增hrtimer_interrupts，保证kthread更新其时间戳
watchdog_interrupt_count();
//唤醒watchdog kthread，保证kthread与timer相同的运行频率
wake_up_process(__this_cpu_read(softlockup_watchdog));
//再次调度hrtimer下一个周期运行
hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
…

//检测是否发生soft lockup
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
    printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
        smp_processor_id(), duration,
        current->comm, task_pid_nr(current));
    print_modules();
    print_irqtrace_events(current);
    //dump 寄存器和堆栈
    if (regs)
        show_regs(regs);
    else
        dump_stack();

    if (softlockup_panic)
        panic("softlockup: hung tasks");
} 
return HRTIMER_RESTART;

}
//检查抢占被关闭的时间间隔
//watchdog kthread在watchdog timer的中断上下文中被唤醒，
//当中断退出时，kthread会抢占cpu上的当前进程。如果
//抢占被关闭的话，则不会发生抢占，watchdog便无法更新时
//间戳，当抢占关闭的时间超过阈值时，核心认为发生了soft
//lock up。
//注：soft lockup阈值 watchdog_thresh * 2 (20s)

static int is_softlockup(unsigned long touch_ts)
{
//当前时间戳
unsigned long now = get_timestamp(smp_processor_id());
//watchdog在 watchdog_thresh * 2 时间内未被调度过
if (time_after(now, touch_ts + get_softlockup_thresh()))
return now - touch_ts;

return 0;

}

函数主要任务：

(1)获取watchdog上次运行的时间戳
(2)递增watchdog timer运行次数
(3)检查watchdog时间戳，是否发生了soft lockup(如果发生了，dump堆栈，打印信息)
(4)重调度timer

lockup 检测函数：

return 0;

}

hard lockup：

hard lock主要在NMI中断中就行检测

1）初始化并使能hard lockup检测

static int watchdog_nmi_enable(unsigned int cpu)
{
//hard lockup事件
struct perf_event_attr *wd_attr;
struct perf_event *event = per_cpu(watchdog_ev, cpu);
…
wd_attr = &wd_hw_attr;
//hard lockup检测周期，10s
wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
//向performance monitoring注册hard lockup检测事件
event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
…
//使能hard lockup的检测
per_cpu(watchdog_ev, cpu) = event;
perf_event_enable(per_cpu(watchdog_ev, cpu));
return 0;
}

perf_event_create_kernel_counter函数主要是注册了一个硬件的事件。
这个硬件在x86里叫performance monitoring，这个硬件有一个功能就是在cpu clock经过了多少个周期后发出一个NMI中断出来。

2）当cpu全负荷跑完20秒后，就会有一个NMI中断发出，对应watchdog_overflow_callback。

static void watchdog_overflow_callback(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
//判断是否发生hard lockup
if (is_hardlockup()) {
int this_cpu = smp_processor_id();

    //打印hard lockup信息
    if (hardlockup_panic)
        panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
    else
        WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu);

    return;
}
return;

}

检测是否有hard lockup

static int is_hardlockup(void)
{
//获取watchdog timer的运行次数
unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
//在一个hard lockup检测时间阈值内，如果watchdog timer未运行，说明cpu中断被屏蔽时间超过阈值
if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
return 1;
//记录watchdog timer运行的次数
__this_cpu_write(hrtimer_interrupts_saved, hrint);
return 0;
}

关闭hard lockup检测

static void watchdog_nmi_disable(unsigned int cpu)
{
struct perf_event *event = per_cpu(watchdog_ev, cpu);
if (event) {
//向performance monitoring子系统注销hard lockup检测控制块
perf_event_disable(event);
//清空per-cpu hard lockup检测控制块
per_cpu(watchdog_ev, cpu) = NULL;
//释放hard lock检测控制块
perf_event_release_kernel(event);
}
return;
}

三、soft Lockup和hard lockup原理

锁机制涉及到了3个机制：kernel watchodog线程，高精度定时器（时钟中断），基于PMU（PMU是一种非常重要的数据采集方法，可以做到一些软件做不到的事情，获取到一些底层硬件的信息）硬件perf event（对每一个event分配一个对应的perf_event结构。所有对event的操作都是围绕perf_event来展开的：）的NMI（Non Maskable Interrupt：不可屏蔽中断，用来通知CPU，发生了“灾难性”的事件，如电源掉电、存储器读写出错、总线奇偶位出错等）。如果CPU太忙导致喂狗（watchdog）不及时，此时系统会打印CPU死锁信息。

soft lockup：抢占被长时间关闭而导致其余进程无法调度；内核中有BUG导致在内核模式下一直循环的时间超过67s（根据实现和配置有所不同），而其他进程得不到运行的机会。这个bug意思是没有让系统彻底死机，但是若干个进程（或者kernel thread）被锁死在了某个状态（一般在内核区域）。

hard lockup：中断被长时间关闭而导致。hard lockup的发生是由于禁止了CPU的所有中断超过一定时间(几秒)情况下，外部设备发生的中断无法处理，内核认为此时发生了所谓的hard lockup，出现这种情况的可能原因是由于spin_lock_irqsave导致的。

3.1、soft Lockup 原理

Softlockup主要用于检查cpu上的任务是否有无法被调度的情况发生。其原理就是在cpu上创建一个实时FIFO优先级为99的percpu内核线程(一般情况下可以认为是系统中优先级最高的任务)，其名字为watchdog；此任务一般会由一个高精度定时器htimer定期唤醒，唤醒后watchdog线程会去执行“喂狗”操作(具体而言就是将当前的时间戳写到变量watchdog_touch_ts)。同时，这个htimer定时器还会定期检查此cpu上“喂狗”是否有发生，即判断当前时间戳与watchdog_touch_ts的差值是否大于一个门限，内核用get_softlockup_thresh()函数来取得此门限值，默认为20秒。如果htimer定时器检测到watchdog_touch_ts距离现在已经超过了门限值就判断为发生了softlockup。

1）SoftLockup 检测首先需要对每一个CPU core注册叫做watchdog的kernel线程，即Linux内核对于每一个cpu都会启动一个监控进程，就是watchdog（看门狗）。可以通过ps –ef | grep watchdog能够看见：[watchdog/0]，[watchdog/1]，[watchdog/2]…，主要作用是将当前cpu时间戳，更新至watchdog_touch_ts；watchdog_enable会注册一个高精度定时器，通过时钟中断响应函数来实现一些看门狗功能。
2）上面系统产生的一个高精度的计时器hrtimer，计时器定期产生时钟中断，该中断对应的中断回调函数是watchdog_timer_fn()；此中断回调函数主要做3件事：
a.watchdog_interrupt_count函数更新hrtimer_interrupts变量（判断hardlockup会用）
b.wake_up_process唤醒watchdog线程（更新时间戳）
c.is_softlockup判断是否出现了soft_lockup  

soft lock detector会检查时间戳，如果超过soft lockup threshold一直未更新，说明[watchdog/x]未得到运行机会，意味着CPU被霸占，也就是发生了soft lockup。

注意，这里面的内核线程[watchdog/x]的目的是更新时间戳，该时间戳是被watch的对象。而真正的看门狗，则是由时钟中断触发的 watchdog_timer_fn()，这里面 [watchdog/x]是被scheduler调用执行的，而watchdog_timer_fn()则是被中断触发的。

触发softlockup的场景有许多种可能，例如在某个cpu上有优先级99的死循环FIFO任务；或者系统中有不健康的软中断在长时间进行软中断处理；又或者内核中产生了死循环等等。即softlockup的检测依赖于两个对象：watchdog线程和定时器htimer。

1）定时器htimer：这个定时器的周期由内核中的变量sample_period来表示，默认为4秒。这个变量又由内核中的另外一个变量watchdog_thresh计算而来，其具体计算方式为：sample_period=watchdog_thresh * 2*(NSEC_PER_SEC/5)，单位纳秒。内核中watchdog_thresh默认为10s，用户可以通过/proc/sys/kernel/watchdog_thresh来指定。例如执行：echo 15 > /proc/sys/kernelwatchdog_thresh后，sample_period的值，即定时器的周期就设置为6秒。

当定时器到期后会调用htimer的时钟处理函数watchdog_timer_fn()做如下事情：
(1) 唤醒watchdog线程；
(2) 检查是否有发生softlockup；
(3) 如果有发生softlockup则进行处理。
其中第(2)步的原理就是检查上一次watchdog线程“喂狗”时距离现在的时间间隔是否超过了门限值softlockup_thresh，默认为20秒。其计算方式为softlockup_thresh=watchdog_thresh * 2。  上面已经提过watchdog_thresh默认为10，可通过/proc接口修改。

2）watchdog内核线程： 看门狗（watchdog）线程是由内核创建的percpu线程，创建后一直睡眠，然后等待htimer周期性的唤醒自己。被唤醒后watchdog线程就会去“喂狗”，即将当前时间戳写入到percpu变量watchdog_touch_ts中；这个进程或者线程每一秒钟运行一次，否则会睡眠和待机。这个进程运行会收集每一个cpu运行时使用数据的时间并且存放到属于每个cpu自己的内核数据结构。在内核中有很多特定的中断函数。这些中断函数会调用soft lockup计数，他会使用当前的时间戳与特定（对应的）cpu的内核数据结构中保存的时间对比，如果发现当前的时间戳比对应cpu保存的时间大于设定的阀值，他就假设监测进程或看门狗线程在一个相当可观的时间还没有执行，即如果某个程序占用cpu的很长时间。看门狗进程就会抓住（catch）这一点并且抛出一个软死锁（soft lockup）错误。软死锁会挂起cpu使你的系统不可用。

正常情况下每隔每隔sample_period秒的实际watchdog线程就会被唤醒一次，由于watchdog线程的优先级是实时FIFO调度策略99的优先级，因而它是系统中优先级最高的任务了。由于它首屈一指的高优先级，可以预测的是，唤醒后它会有机会立刻得到运行(除非系统中也有优先级为99的FIFO任务在运行状态)，即去更新watchdog_touch_ts变量。因而，我们还可以预测，正常情况下watchdog_touch_ts每隔sample_period秒就会更新一次，或者说递增一次，递增的粒度也正好是htimer时钟的周期sample_period。

但是也存在意外：

(1) watchdog没有被及时唤醒。这种情况就是htimer没有及时触发；由于htimer唤醒是在高精度时钟中断中完成的，除非时钟中断出现问题，否则watchdog不会出现这种情况。而这种情况多半是长时间关闭本地中断造成的，这种情况一般需要通过hardlockup机制来检测；

(2) 系统中有长时间处于软中断处理函数执行状态。虽然watchdog线程优先级很高，但是只是任务级别的执行流，对于软中断执行流程是优先级更低的。假如系统中的有一个软中断由于某种原因进入到了死锁或者死循环，那么watchdog线程是无法执行的，这样watchdog_touch_ts也无法及时得到更新。

(3) cpu上有优先级为99的FIFO任务一直占有cpu。实时FIFO的调度策略是严格按优先级调度的，而同等优先级的任务遵守先来先运行、一直到无法再运行为止的原则。因而，如果cpu上原来已经有一个调度策略为FIFO且优先级为99的任务一直占有CPU时watchdog即使被唤醒后也是无法得到运行的。

这里列举了典型的三种watchdog无法及时获得cpu运行的情况，也就是发生soft lockup的典型情况，也就是说明此cpu有比FIFO99优先级相等或更高优先级的任务/模块在一直占用cpu。

一旦发生softlockup内核一般会有类似如下的消息打印：

BUG: soft lockup - CPU#%d stuck for 22us! [loop:1023]
kernel:BUG: soft lockup - CPU#0 stuck for 38s! [kworker/0:1:25758]
kernel:BUG: soft lockup - CPU#7 stuck for 36s! [java:16182]
......

即softlockup的处理主要是打印告警信息来通知相关人员有相关情况发生。整个处理流程有如下几个部分：

(1) 通过soft_watchdog_warn判断是否已经发生过告警信息打印，则不再继续；

(2) 打印告警信息和current现场等等告警信息；

(3) 如果softlockup_all_cpu_backtrace变量使能，则打印其他各个cpu的堆栈回溯信息，这个变量可通过/proc/sys/kernel/softlockup_all_cpu_backtrace接口来enable/disable；

(4) 如果使能softlockup_panic变量，则进入panic。变量softlockup_panic可通过/proc/sys/kernel/softlockup_panic来修改。

原因：一般出现softlockup原因基本就两种情况，死锁和死循环。

当我们遇到soft lockup错误时，只要分析相关代码是不是存在死循环；或者分析是不是相关代码在使用锁时不正确导致了死锁。另外以下这些问题也会导致：

内核繁忙，也就是说内核长时间的在处理事务，而watchdog得不到执行；

服务器电源供电不足，导致CPU电压不稳也会导致CPU死锁；或 BIOS开启了超频，导致超频时电压不稳，容易出现CPU死锁
 
虚机所在的宿主机的CPU太忙或磁盘IO太高；磁盘利用率高（〜100％），也会导致软锁；

虚机的的CPU太忙或磁盘IO太高；对于虚拟机场景，还有一种可能就行虚拟化机制带来的overcommit(尤其是内存过量使用或其他虚拟化开销过大时），比如 hypervisor管理程序未及时调度虚拟 CPU；

BIOS KVM开启以后的相关bug，关闭KVM可解决，但关闭以后物理机不支持虚拟化；或 KVM存在bug

VM网卡驱动存在bug，处理高水位流量时存在bug导致CPU死锁；

Linux kernel存在bug；

时钟源 tsc 在 CentOS 和带有 Hyper-V 虚拟化的云 Linux 上不稳定；可通过设置clocksource=jiffies可解决

BIOS Intel C-State开启导致，关闭可解决；

BIOS spread spectrum开启导致；将此项设定为disabled，可以优化系统的性能表现和稳定性；否则应该将此项设定为enabled。 如果对cpu进行超频，必须将此项禁用。因为即使是微小的脉冲值漂移也会导致超频运行的cpu锁死。

在虚拟机场景下，对于soft lockup我遇到过多例都是soft lockup死在了ipi请求过程中，特别是函数smp_call_function_many中：[exception RIP: smp_call_function_many+514]，很大可能就是虚拟机vcpu调度引发的问题。

内核参数kernel.watchdog_thresh（/proc/sys/kernel/watchdog_thresh）系统默认值为10。如果超过2*10秒会打印信息，注意：调整值时参数不能大于60。虽然调整该值可以延长喂狗等待时间，但是不能彻底解决问题，只能导致信息延迟打印。因此还是需要找到根本原因。可以打开panic，将/proc/sys/kernel/panic的默认值0改为1，便于定位。 vcpus超过物理cpu cores；

sysctl -w kernel.watchdog_thresh=30 //或vi /etc/sysctl.conf增加kernel.watchdog_thresh=30

3.2、hard Lockup 原理

1）watchdog_enable中会通过watchdog_nmi_enable注册一个基于PMU硬件的的perf event，经过watchdog_thresh（/proc/sys/kernel/watchdog_thresh）秒的时间会触发NMI中断；

2）中断处理函数通过检测在二个NMI中断相应后的hrtimer_interrupts(上次的值hrtimer_interrupts_saved)值是否发生变化来判断是否发生hardlockup；

3）保存中断计数hrtimer_interrupts_saved=hrtimer_interrupts。这个硬件在x86里叫performance monitoring，这个硬件有一个功能就是在cpu clock经过了多少个周期后发出一个NMI中断出来。

perf event内核框架

对每一个event分配一个对应的perf_event结构。所有对event的操作都是围绕perf_event来展开的：

通过perf_event_open系统调用分配到perf_event以后，会返回一个文件句柄fd，这样这个perf_event结构可以通过read/write/ioctl/mmap通用文件接口来操作。

perf_event提供两种类型的trace数据：count和sample。count只是记录了event的发生次数，sample记录了大量信息(比如：IP、ADDR、TID、TIME、CPU、BT)。如果需要使用sample功能，需要给perf_event分配ringbuffer空间，并且把这部分空间通过mmap映射到用户空间。

perf因为它给每个event都独立一套数据结构perf_event，对应独立的attr和pmu。不过perf的设计初衷也不是让成百上千的event同时使用，只会挑出一些event重点debug。perf使用cpu维度/task维度来组织perf_event。

更多参考