Linux 平台下Oracle 9i/10g/11gR1 IO-Fencing 的hangcheck-timer 模块说明

原创 2012年07月23日 00:16:22




Linux: Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and11gR1 RAC [ID 726833.1]


Hangcheck_timermodule is required to run a supported configuration in Oracle Real ApplicationClusters environments on Linux, with Oracle releases 9i, 10g, or 11gR1RAC.  This note identifies and outlines the requirements needed toconfigure hangcheck-timer in an Oracle Enterprise Linux, Red Hat Linux, or SUSELinux environment.

Linux 环境下Oracle 9i10g11gR1 RAC 需要配置Hangcheck_timer模块。

Note : Hangheck timer is notrequired starting with Oracle Clusterware 11gR2

注意,在模块在11gR2RAC 中已经不在需要配置了。


Starting in release later, Oracle RAC environments required using a new I/O fencing model,named the hangcheck-timer module. This module was implemented to replace theWatchdog module, which provided similar fencing functionality. Hangcheck-timerwas subsequently delivered as part of the standard kernel distribution forLinux kernel releases 2.4 and above.版本开始,ORACLERAC环境需要使用一个新的I/O fencing模块,叫做hangcheck-timer模块。这个模块用来代替Watchdog模块,提供类似的fencing功能。Hangcheck-timer模块是标准的linux2.4以上的内核中的一个子功能被发布。


Hangcheck-timer shouldbe loaded at boot time, and monitors the Linux kernel for long operatingsystem hangs that could affect the reliability of a RAC node.  It runs inkernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays ornode hangs.  This is done by setting a timer, then checking when the timerfires as to whether it was delayed by more than the allowed margin oferror.  If the duration exceeds the allowed time of (hangcheck_tick +hangcheck_margin seconds), the machine is restarted.  Hangcheck-timer willnot cause reboots to occur due to CPU starvation.

--Hangcheck-timer应该在系统启动的时候被加载, 并对于能够影响RAC节点稳定性的长时间的系统操作HANG进行内核监控。它运行在内核级别并使用Time Stamp Counter(TSC)来捕捉调度的延迟和节点HANG。这是通过设置一个timer,然后检查这个timerfires情况来判断是否延迟是否超过了误差的幅度。如果这个周期超过了允许的时间(也就是hangcheck_tick+hangcheck_margin秒),机器将会被重启,如果是CPU资源不足的时候,Hangcheck-timer将不会导致重启。



 Hangcheck-timer requiresthree configuration parameters:


(1)    hangcheck_tick - defines howoften, in seconds, the hangcheck-timer checks the node for hangs. The defaultvalue is 60 seconds.

-- hangcheck_tick:定义了hangcheck-timer检查节点是否hang的频率,单位是秒,缺省是60.

(2)    hangcheck_margin - defines howmuch margin is allowed, in seconds, between expected scheduling and realscheduling time. The default value is 180 seconds.


(3)    hangcheck_reboot - determinesif the hangcheck-timer restarts the node if the kernel fails to respond withinthe sum of the hangcheck_tick and hangcheck_margin parameter values. If thevalue of hangcheck_reboot is equal to or greater than 1, then thehangcheck-timer module restarts the system. If the hangcheck_reboot parameteris set to zero, then the hangcheck-timer module will not reboot the node,even if a hang is detected.   The default value varies by kernelversion.  In the 2.4 kernel, the default is 1.  In 2.6 kernels, thedefault is 0.

--hangcheck_reboot:定义了如果内核在hangcheck-tickhangcheck-margin相加的时间内响应失败的话,hangcheck-timer是否重启节点。如果hangcheck_reboot的值大于等于1,hangcheck-timer模块将会重启系统;如果设置为0,则即使系统hang的时候hangcheck-timer也不会重启系统。在linux 2.4的内核中,这个缺省值是1;在2.6的内核中,缺省值是0


当hangcheck_reboot=1并且满足下面的公式时,hangcheck-timer将reboot系统: system hang time > (hangcheck_tick + hangcheck_margin)



All hangcheck-timer defaultvalues should be explicitly overridden when loading the kernel module, based onthe Oracle release as follows: 



19i: Assuming thedefault setting of "oracm misscount" is set to 220 seconds: 

hangcheck_tick=30hangcheck_margin=180 hangcheck_reboot=1

--9i: 假如"oracle misscount"的缺省设置是220秒,则hangcheck_tick=30hangcheck_margin=180 hangcheck_reboot=1

210g/11gR1: Assuming thedefault setting of "CSS misscount" is set to either 30 or 60seconds:

hangcheck_tick=1hangcheck_margin=10 hangcheck_reboot=1

--10g/11gR1: 假如"CSS misscount"的设置是30或者60秒,则hangcheck_tick=1hangcheck_margin=10 hangcheck_reboot=1


You must always ensure thatthe Cluster misscount setting is greater than the sum of the setting forhangcheck_tick + hangcheck_margin.

--注意:你必须设置集群的misscount值大于hangcheck_tick + hangcheck_margin之和。


When running OracleClusterware on Linux, hangcheck-timer should always be configured on each RACcluster node, as the functionality of this module is required to provide I/O Fencingto ensure no stray writes will occur from an evicted node in a RACcluster.  To verify if the hangcheck-timer module is running on a nodeexecute as the root or oracle user:

       --Linux 平台上的Clusterware,需要在每个节点上配置hangcheck-timer模块,可以用root用户执行如下命令来验证hangcheck-timer是否运行:


# /sbin/lsmod | grep hangcheck

hangcheck-timer         2672   0


If the hangcheck-timer moduleis loaded (running) you will see output similar to above. When hangcheck-timeris not loaded no output is generated, and the command prompt is returned to theuser.


In an Oracle Enterprise Linux,Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loadedusing the modprobe command:


# modprobe hangcheck-timer  hangcheck_tick=1 hangcheck_margin=10hangcheck_reboot=1


In order to ensure the moduleis loaded at boot time, you should also place the same command in the appropriatelocal command execution directory (e.g. /etc/rc.d/rc.local, or/etc/init.d/boot.local).  In earlier releases, hangcheck-timer was loadedusing insmod in place of modprobe. Consult your release specific documentationto determine which initialization method is required.

       --为了确保在系统启动时就装载了hangcheck-timer模块,我们可以将命令添加到/etc/rc.d/rc.local,or /etc/init.d/boot.local中。


Hangcheck-timer will providemessage logging to the system messages log when a failure is detected, and anode restart is initiated by the module:


(1)    When Hangcheck-timer reboots itmay leave "Hangcheck: hangcheck is restarting the machine" message in/var/log/messages。

-- hangcheck-timer的启动信息都会记录在系统日志里“ /var/log/messages”,重启时会记录"Hangcheck:hangcheck is restarting the machine"信息到/var/log/messages

(2)    If you see the followingmessage in /var/log/messages:  "Hangcheck: hangcheck value pastmargin!" this means a reboot was required but was not performed, becausehangcheck_reboot was not set to 1.  If this message is seen, you mustreload the hangcheck module as described earlier in this note, with thehangcheck_reboot value set to 1.

--如果你看到/var/log/messages中有"Hangcheck:hangcheck value past margin!"消息,表示系统需要重启但是没有重启,因为hangcheck-reboot参数没有设置为1



Bug:6125546 which can preventhangcheck-timer from rebooting in RHEL4 (fixed in or RHEL4.6)




Hangcheck-timer 是Linux 提供的一个内核级的IO-Fencing 模块, 这个模块会监控Linux 内核运行状态, 如果长时间挂起, 这个模块会自动重启系统。 这个模块在Linux内核空间运行, 不会受系统负载的影响。 这个模块会使用CPU的Time Stamp Counter(TSC) 寄存器,这个寄存器的值会在每个时钟周期自动增加, 因此使用的是硬件时间,所以精度更高。

配置这个模块需要2个参数:hangcheck_tick 和 hangcheck_margin。


hangcheck_tick用于定义多长时间检查一次,缺省值是30秒。 有可能内核本身很忙, 导致这个检查被推迟, 该模块还允许定义一个延迟上限,就是hangcheck_margin, 它的缺省值是180秒。


Hangcheck-timer 模块会根据hangcheck_tick 的设置,定时检查内核。只要2次检查的时间间隔小于 hangcheck_tick +hangchec_margin, 都会认为内核运行正常,否则就意味着运行异常,这个模块会自动重启系统。


CRS本身还有一个MissCount 参数,可以通过crsctl get css miscount 命令查看。


当RAC结点间的心跳信息丢失时, Clusterware 必须确保在进行重构时,故障结点确实是Dead 状态,否则结点仅是临时负载过高导致心跳丢失,然后其他结点开始重构,但是结点没有重启,这样会损坏数据库。 因此MissCount 必须大于 hangcheck_tick+hangcheck_margin的和。


2.1 hangcheck-timer.ko模块安装

hangcheck-timer被默认安装在linux版本 2.4.9-e.12 及之上版本中,可以用如下命令核查hangcheck-timer是否安装。


[root@rac1 ~]#  find /lib/modules-name "hangcheck-timer.ko"





2.2 配置hangcheck-timer 模块

配置hangcheck-timer参数, 在/etc/modprobe.conf 中添加如下内容,这里根据数据库版本不同,内容也不同。

(1)9i: 假如"oracle misscount"的缺省设置是220秒,则hangcheck_tick=30hangcheck_margin=180 hangcheck_reboot=1

210g/11gR1: 假如"CSS misscount"的设置是30或者60秒,则hangcheck_tick=1hangcheck_margin=10 hangcheck_reboot=1



[root@rac1 ~]# vi /etc/modprobe.conf

options hangcheck-timer hangcheck_tick=30hangcheck_margin=180


2.3 配置系统启动时自动加载模块

在/etc/rc.d/rc.local 中添加如下内容

[root@rac1 ~]# modprobe hangcheck-timer

[root@rac1 ~]# vi /etc/rc.d/rc.local

modprobe hangcheck-timer



[root@rac1 ~]# grep Hangcheck/var/log/messages | tail -2

Sep  7 19:53:03 rac1 kernel:Hangcheck: starting hangcheck timer 0.9.0 (tick is 180 seconds, margin is 60seconds).

Sep  7 19:53:03 rac1 kernel:Hangcheck: Using monotonic_clock().



[root@rac2 ~]# /sbin/lsmod |grep hangcheck
hangcheck_timer         7897  0
























DBA1 群:62697716(满);   DBA2 群:62697977(满)  DBA3 群:62697850(满)  

DBA 超级群:63306533(满);  DBA4 群:83829929   DBA5群: 142216823

DBA6 群:158654907    DBA7 群:172855474   DBA总群:104207940



使用ActionForward导航       ActionForward对象是配置对象。这些配置对象拥有独一无二的标识以允许它们按照有意义的名称如“success”,“failure”等来检索。Ac...
  • icecloud
  • icecloud
  • 2003-05-13 11:34:00
  • 2209

10gR2 RAC(五)配置时间同步和hangcheck-timer模块

10gR2 RAC(五)配置时间同步和hangcheck-timer模块 2011-03-04 17:52 4、配置时间同步 在安装Oracle集群件和Oracle数据库软件时,Oracle...
  • huangzhaoyang2009
  • huangzhaoyang2009
  • 2011-10-26 14:08:28
  • 2933


此文档摘自METALINK:726833.1,所适用的ORACLE版本为:Oracle Server - Enterprise Edition - Version: to 11.1....
  • changyanmanman
  • changyanmanman
  • 2013-09-03 09:41:42
  • 1404

[Oracle 11g r2(]集群守护进程CSS介绍

CSS ( Cluster Synchronization Service)这个组件负责构建集群, 并且维护集群的一致性。 会对css 的启动过程、NM ( Node Management)和GM ...
  • a743044559
  • a743044559
  • 2017-10-17 13:21:20
  • 227

Oracle RAC 常用维护工具和命令

Oracle 的管理可以通过OEM或者命令行接口。 oracle Clusterware的命令集可以分为以下4种: 节点层:osnodes 网络层:oifcfg 集群层:crsctl, ocrchec...
  • qq_33555383
  • qq_33555383
  • 2017-07-24 10:33:51
  • 154

10g RAC: Steps To Increase CSS Misscoun,t Reboottime and Disktimeout

                The purpose of this note is to document the steps needed to modify the CSS misscount...
  • tianlesoftware
  • tianlesoftware
  • 2011-04-23 17:42:00
  • 2880

RAC 管理(crs_stat、crsctl、srvctl)

Oracle Clusterware的命令集可以分为以下4种:  节点层:osnodes  网络层:oifcfg  集群层:crsctl, ocrcheck,ocrdump,ocrconfig ...
  • clon
  • clon
  • 2017-12-05 09:47:17
  • 88

Oracle RAC 常用维护工具和命令

Oracle 的管理可以通过OEM或者命令行接口。 Oracle Clusterware的命令集可以分为以下4种:节点层:osnodes网络层:oifcfg集群层:crsctl, ocrcheck,o...
  • tianlesoftware
  • tianlesoftware
  • 2010-03-09 01:02:00
  • 23030


Oracle 的管理可以通过OEM或者命令行进行。 Oracle Clusterware的命令集可以分为以下4种: 节点层:olsnodes 网络层:oifcfg 集群层:crsctl,ocrc...
  • xyz846
  • xyz846
  • 2012-03-19 17:18:25
  • 968


Oracle 数据库教程  ——  rac 常用维护工具和命令   注:本文整理自  注: Oracle 的管理可以通过OEM 或者命令行接口。 Oracle Cluste...
  • knuuy
  • knuuy
  • 2015-08-05 16:05:09
  • 553
您举报文章:Linux 平台下Oracle 9i/10g/11gR1 IO-Fencing 的hangcheck-timer 模块说明