watchdog timer(overall)

Watchdog timers are essential in remote, automated systems such as this Mars Exploration Rover
A watchdog timer (sometimes called a computer operating properly or COP timer, or simply a watchdog) is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer to prevent it from elapsing, or “timing out”. If, due to a hardware fault or program error, the computer fails to reset the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective action or actions. The corrective actions typically include placing the computer system in a safe state and restoring normal system operation.

Watchdog timers are commonly found in embedded systems and other computer-controlled equipment where humans cannot easily access the equipment or would be unable to react to faults in a timely manner. In such systems, the computer cannot depend on a human to invoke a reboot if it hangs; it must be self-reliant. For example, remote embedded systems such as space probes are not physically accessible to human operators; these could become permanently disabled if they were unable to autonomously recover from faults. A watchdog timer is usually employed in cases like these. Watchdog timers may also be used when running untrusted code in a sandbox, to limit the CPU time available to the code and thus prevent some types of denial-of-service attacks.[1]

1.watchdog reset
In computers that are running operating systems, watchdog resets are usually invoked through a device driver. For example, in the Linux operating system, a user space program will kick the watchdog by interacting with the watchdog device driver, typically by writing a zero character to /dev/watchdog. The device driver, which serves to abstract the watchdog hardware from user space programs, is also used to configure the time-out period and start and stop the timer.

and stop the timer.

Single-stage watchdog
在这里插入图片描述Multistage watchdog

在这里插入图片描述Time intervals
Watchdog timers may have either fixed or programmable time intervals. Some watchdog timers allow the time interval to be programmed by selecting from among a few selectable, discrete values. In others, the interval can be programmed to arbitrary values. Typically, watchdog time intervals range from ten milliseconds to a minute or more. In a multistage watchdog, each timer may have its own, unique time interval.

Corrective actions(very useful)
A watchdog timer may initiate any of several types of corrective action, including maskable interrupt, non-maskable interrupt, processor reset, fail-safe state activation, power cycling, or combinations of these. Depending on its architecture, the type of corrective action or actions that a watchdog can trigger may be fixed or programmable. Some computers (e.g., PC compatibles) require a pulsed signal to invoke a processor reset. In such cases, the watchdog typically triggers a processor reset by activating an internal or external pulse generator, which in turn creates the required reset pulses.[3]

In embedded systems and control systems, watchdog timers are often used to activate fail-safe circuitry. When activated, the fail-safe circuitry forces all control outputs to safe states (e.g., turns off motors, heaters, and high-voltages) to prevent injuries and equipment damage while the fault persists. In a two-stage watchdog, the first timer is often used to activate fail-safe outputs and start the second timer stage; the second stage will reset the computer if the fault cannot be corrected before the timer elapses.

Watchdog timers are sometimes used to trigger the recording of system state information—which may be useful during fault recovery[3]—or debug information (which may be useful for determining the cause of the fault) onto a persistent medium. In such cases, a second timer—which is started when the first timer elapses—is typically used to reset the computer later, after allowing sufficient time for data recording to complete. This allows time for the information to be saved, but ensures that the computer will be reset even if the recording process fails.
在这里插入图片描述
Fault detection
A computer system is typically designed so that its watchdog timer will be kicked only if the computer deems the system functional. The computer determines whether the system is functional by conducting one or more fault detection tests and it will kick the watchdog only if all tests have passed. In computers that are running an operating system and multiple processes, a single, simple test may be insufficient to guarantee normal operation, as it could fail to detect a subtle fault condition and therefore allow the watchdog to be kicked even though a fault condition exists.

For example, in the case of the Linux operating system, a user-space watchdog daemon may simply kick the watchdog periodically without performing any tests. As long as the daemon runs normally, the system will be protected against serious system crashes such as a kernel panic. To detect less severe faults, the daemon[4] can be configured to perform tests that cover resource availability (e.g., sufficient memory and file handles, reasonable CPU time), evidence of expected process activity (e.g., system daemons running, specific files being present or updated), overheating, and network activity, and system-specific test scripts or programs may also be run.[5]

Upon discovery of a failed test, the Linux watchdog daemon may attempt to perform a software-initiated restart, which can be preferable to a hardware reset as the file systems will be safely unmounted and fault information will be logged. However it is essential to have the insurance of the hardware timer as a software restart can fail under a number of fault conditions. In effect, this is a dual-stage watchdog with the software restart comprising the first stage and the hardware reset the second stage.

问题:

  1. wdt 超时时间太长,导致无法抓住第一现场。
  2. cpu stuck住了,导致软件狗不生效。
  3. 触发hw wdt,由于没刷cache导致dump的数据不可靠。
  4. 时钟不稳定,导致正常业务触发WDT

常见触发wdt:
1 task长时间占用cpu
死锁,死循环,数据量超出预期等。
2.硬件原因cpu stcuk 住触发硬件wdt。

维护:
1.有task切换以及运行时间的记录;
2.有喂狗记录

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
喂狗(reset watchdog timer)是一种常见的方法,用于防止看门狗定时器超时并触发系统复位。当系统处于正常运行状态时,定期喂狗可以重置看门狗定时器的计时,以保持系统的稳定性。 具体的喂狗方法可能因不同的系统和硬件而有所不同,以下是一般的喂狗方法示例: 1. 硬件方式:某些系统和微控制器芯片提供了专门的硬件引脚或寄存器用于喂狗。您可以通过设置相应的寄存器或将引脚置于特定状态来重置看门狗定时器。具体的操作方式请参考您使用的芯片或板级支持软件的文档。 2. 软件方式:如果您没有硬件支持,可以使用软件方式来喂狗。在编程语言中,通常会有相应的API或函数用于重置看门狗定时器。您可以在适当的位置调用这些函数,以确保看门狗定时器不会超时。具体的API调用方式请参考您使用的编程语言和开发环境的文档。 请注意,喂狗的频率和位置对于系统稳定性非常重要。太频繁地喂狗可能会导致系统性能下降,而太少喂狗则可能无法防止看门狗超时。您可以根据系统的运行情况和看门狗定时器的设置进行适当的调整。 另外,喂狗只是一种应对看门狗复位的方法,如果系统出现了其他问题导致看门狗复位,可能需要进一步排查和解决根本问题。 希望以上信息对您有所帮助!如果您有更多问题,请随时提问。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值