Avoiding HP Server Resets due to IPMI Watchdog Timeout (Doc ID 2379256.1)

APPLIES TO:

BNS Platform Hardware - Version UDR 10.2 and later
Oracle Communications Performance Intelligence Center (PIC) Hardware - Version 10.2 and later
Information in this document applies to any platform.

GOAL

 Assist Customers and Support teams on how to increase the IPMI watchdog timeout value on HP Gen8 and Gen9 servers so they can avoid system resets associated with iLO congestion.

SOLUTION

TPD implements a hardware watchdog by setting up two timers. There is only a hardware watchdog running.

  • A hardware timer in the BMC (Baseboard Management Controller) with a timeout of 120 seconds
  • A software timer (watchdog daemon) which will write to the /dev/watchdog device every 5 seconds

The TKLCwatchdog service will initialize these timers by loading the ipmi_watch kernel module with the timeout of 120 seconds and then start the software watchdog daemon to write to the /dev/watchdog device every 5 seconds.

For HP servers the IPMI BMC functionality is emulated by the ILO.  It is the ILO that will perform the reset of the server when the 120 second timer expires.

The write to the /dev/watchdog will result in the IPMI command 22 (Reset Watchdog Timer) being issued to the ILO. (Information collected from the IPMI spec).

HP Server Resets due to IPMI watchdog timeout

There were reports of HP servers randomly resetting due to watchdog timeout issues. It is believed that the IPMI reset watchdog timer commands are not being processed due to congestion on the ILO, leading to IPMI timer expiration and a subsequent reboot of the server.

Logs from the servers show the following messages that point to the IPMI watchdog reset commands not being serviced:

  • kernel: IPMI Watchdog: response: Error c0 on cmd 22 (Completion code c0 indicates 'Node Busy.  Command could not be processed because command processing resources are temporary unavailable')
  • kernel: IPMI Watchdog: response: Error ff on cmd 22 (Completion code ff indicates 'Unspecified error.')

Investigations point to a temporary congestion on ILO due to PCI devices (NICs for example) sending overload of sensor data that is slowing down the ILO such that the watchdog timer resets are not being processed.

HP recommends a longer IPMI timeout which will allow the congestion to be cleared so that the watchdog resets will be serviced. Alternatively, the HP Advanced Service Recovery Daemon can be used in place of the IPMI hardware watchdog, as this mechanism does not use IPMI.  However, the use of this Recovery Daemon is not compatible with TPD versions below 7.6.

Steps to extend IPMI watchdog timeout

  1. Determine the hardware ID of the server by executing the following command

# hardwareInfo | grep ID

      2. Create new watchdogSetup file for this hardware ID from the G7 watchdogSetup file which has a longer timeout value defined by executing the following command

                 # cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-<hardwareID>

                 Example for hardware ID ProLiantBL460cGen8: cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL460cGen8

     3. Restart the TKLCwatchdog service by executing the following command          

                 # service TKLCwatchdog restart

     4. Validate watchdog timer value after boot by executing the following command

                  # ipmitool mc watchdog get (verify output indicates 'Initial Countdown: 300 sec')

example output:
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 300 sec
Present Countdown: 298 sec

Notes:

     1.  Watchdog Timer default for ProLiantBL685cG7 is already set to 300 seconds, so the timer will not need to be changed.

     2.  This new file will survive upgrades.

     3.  In case of a server replacement due to a system fault, this timer procedure will need to be run after the new system is restored

     4. This procedure applies to TPD releases 6.5.1-82.28.0 through 7.5.0.0.0.0-88.46.0.

转载至https://support.oracle.com/epmos/faces/DocumentDisplay?id=2379256.1&_adf.ctrl-state=n9vi664ae_4&_afrLoop=361229577471965

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值