Oracle CSS的参数设置心跳时间设置

最新推荐文章于 2021-04-16 05:56:59 发布

cuizhu0832

最新推荐文章于 2021-04-16 05:56:59 发布

阅读量371

点赞数

Oracle Database 10g Release 2 CSS 参数介绍 (2009-08-03 14:55:05)

标签：oracle css it

分类：oracle

With different patch-sets of Oracle Database 10g Release 2; there exist different timeout parameters which are used by CSS while accessing storage data. In this document we will cover following Oracle Database 10g Release 2 patch-set versions:

1. Oracle Database 10.2.0.1

2. Oracle Database 10.2.0.1 + Patch for Bug 4896338

3. Oracle Database 10.2.0.2

4. Oracle Database 10.2.0.3

1. Oracle Database 10.2.0.1

There is only one CSS parameter available in this version of Oracle and it is called misscount which represents the maximum time in seconds that, a heartbeat can be missed before entering into cluster reconfiguration to evict the node, and the maximum time allowed for a voting file I/O to complete.

The default value for misscount is 60 seconds.

2. Oracle Database 10.2.0.1 + Patch 4896338 and Oracle Database 10.2.0.2

There is bug 4896338 with Oracle Database 10.2.0.1 which is a placeholder bug for PCW 10.2.0.1 merge for very low brownout. Please refer www.metalink.oracle.com for more details.

Oracle Database 10.2.0.2 has a fix for this bug.

There are three CSS parameters available in 10.2.0.2 and 10.2.0.1 + patch for bug 4896338; they are as follows:

a) misscount - It represents maximum time in seconds that, a heartbeat can be missed before entering into a cluster reconfiguration to evict the node.

b) disktimeout - It is the maximum amount of time allowed for a voting file I/O to complete; if this time is exceeded the voting disk will be marked as offline.

c) reboottime - It is the amount of time allowed for a node to complete a reboot after the CSS daemon has been evicted.

Default values for these parameters are as follows:

misscount = 60 seconds

disktimeout = 200 seconds

reboottime = 3 seconds

Using "crsctl get css disktimeout / reboottime" will not show parameter value unless you modify it explicitly. You can check the parameter's values using ocssd.log under $CRS_HOME directory. 8

CRS internally calculates two parameters namely diskshorttimeout and disklongtimeout (can be checked in ocssd.log), where

a) diskshorttimeout = misscount - reboottime : This value is used during reconfiguration and initial cluster formation as a timeout for voting file I/O to complete.

b) disklongtimeout = disktimeout : This value is used during normal operation of RAC as a timeout for voting file I/O to complete.

3. Oracle Database 10.2.0.3

This version also has same parameters as that of Oracle Database 10.2.0.2; also the default values are same as Oracle Database 10.2.0.2. There is slight difference in the internal calculation of there parameter values; If disktimeout is less than the misscount value then during cluster formation and throughout cluster operation misscount - reboottime is considered as disktimeout and the modified parameter disktimeout is ignored.

That is in Oracle Database 10.2.0.3 diskshorttimeout = disklongtimeout if css disktimeout parameter is less than css misscount.

4. Recommendations for Oracle Database 10g Release 2 CSS parameter values to be used with NetApp storage:

As diskshorttimeout = misscount - reboottime; and if misscount & reboottime are kept as default values i.e. 60 seconds & 3 seconds respectively; the time for accessing voting file will be considered as 57 seconds by CSS, so If the reconfiguration happens during the NetApp Storage takeover or giveback process there are chances of CRS reboot taking place; hence following are the recommended values for CSS timeout parameters for Oracle Database 10g Release 2 RAC to work smoothly during NetApp Storage takeover and giveback process.

1. Oracle Database 10.2.0.1

misscount = 120 seconds (default is 60 seconds)

2. Oracle Database 10.2.0.1 + Patch for Bug 4896338

misscount = 120 seconds (default is 60 seconds)

disktimeout = 200 seconds (default)

reboottime = 3 seconds (default)

3. Oracle Database 10.2.0.2

misscount = 120 seconds (default is 60 seconds)

disktimeout = 200 seconds (default)

reboottime = 3 seconds (default)

4. Oracle Database 10.2.0.3

misscount = 120 seconds (default is 60 seconds)

disktimeout = 200 seconds (default)

reboottime = 3 seconds (default)

All the above recommendations are for Linux Operating system.

Note: The stock version of Oracle database 10g Release 2 lower than 10.2.0.2 do not provide all the configurable CSS parameters; hence it is advisable to upgrade Oracle Database to 10.2.0.2 or higher.

Appendix

Commands to check / modify CSS parameters:

1. crsctl get css misscount ---------- to check misscount value

2. crsctl get css disktimeout --------- to check disktimeout value

3. crsctl get css reboottime ---------- to check reboottime value

4. crsctl set css misscount 120 --------- to set misscount to 120 seconds

5. crsctl set css disktimeout 200 ------- to set disktimeout to 200 seconds

6. crsctl set css reboottime 3 ----------- to set reboottime to 3 seconds

翻译

CSS的MISSCOUNT参数定义了集群重新配置并驱逐一个节点前，这个节点的心跳可以缺失的时间，单位是秒。其中各个平台的缺省的MISSCOUNT如下：
Linux下60秒，Unix、VMS和Windows下都是30秒，对于其他非ORACLE提供的CLUSTER来说，缺省的MISSCOUNT是600秒，这是为了给其他的集群提供商提供足够的时间来解决任何场景下的脑裂。

CSS心跳的机制和他们之间的相互关系:

ORACLE CLUSTERWARE的同步服务组件维护两种心跳机制。
1、对于投票设备的磁盘心跳
2、通过内网互连的网络心跳来验证集群中的成员的有效性。
这两种心跳机制有一个联合的超时时间，磁盘心跳有一个内在的以秒为单位的超时间隔（简称：IOT），这个时间内对投票盘的I/O必须完成。 MISSCOUNT参数是网络心跳可以丢失的次数，以秒计。磁盘心跳的I/O超时间隔直接和MISSCOUNT参数（简称：MC）的设置相关。这些咚咚的关联关系如下：
1、9的版本中MISCOUNT不是上面描述的那些含义
2、10.1.0.2的版本中没有这些设置
3、10.1.0.3版本中的IOT=MC-15 seconds
4、10.1.0.4版本中，IOT = MC - 15 seconds
5、10.1.0.4打上3306964BUG的补丁后，IOT = MC - 3 seconds
6、10.1.0.4打上CRS II Merge patch后，IOT=缺省的200秒的DISKTIMEOUT设置，或者在CLUSTER的初始配置或者重新配置的时候的MISSCOUNT的设置。（下面的原文不知道怎么翻译何时，所以原文抄在这里吧。）
IOT=Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration
7、10.1.0.5版本上的IOT = MC - 3S
8、10.2.0.1打上4896338 BUG补丁后，IOT=缺省的200秒的DISKTIMEOUT设置，或者在CLUSTER的初始配置或者重新配置的时候的MISSCOUNT的设置。
9、10.2.0.2的版本，IOT=缺省的200秒的DISKTIMEOUT设置，或者在CLUSTER的初始配置或者重新配置的时候的MISSCOUNT的设置。

对于投票磁盘的过长的反应时间:
如果对投票磁盘的I/O的反应时间大于IOT的时间，则CLUSTER将会对CSS的节点进行驱逐。这个第一取决于CRS的版本，第二取决于是否应用了merge的补丁，第三是CLUSTER的状态。

产生这样的问题有很多的原因，下面总结了大部分的情况：
QLOGIC HBA卡的连接宕掉的时间大于了MC
SAN或者存储的线缆损坏导致I/O超时
SAN交换机的FAILOVER的响应时间大于MC设置
EMC Clariion Array的trespassing SP到备份SP的时候的时间超过了MC设置
EMC PowerPath的路径错误并且I/O重新投递并重定向的时间大于MC设置
NETAPP集群的FAILOVER的响应时间大于MC
持续的高CPU负载影响的CSSD的进程进行对磁盘的ping操作
差的SAN网络配置使得I/O路径上的响应实现超过MC
大多数情况和多路径的软件相关，而且是因为IO路径的FAILOVER后的重配置时间过长

MISSCOUNT在上面提到的情况中是不应该被修改的。

应用了Bug 4896338的补丁后的10.2.0.1版本的行为
10.2.0.1 打上4896338的补丁后，当对投票磁盘的I/O超过MC的设置时，CSS不会把节点驱逐出CLUSTER，除非这个发生在CLUSTER初始化或者重配置的时候。所以，如果我们有N个成员的CLUSTER节点中，如果一个节点因为访问投票磁盘而发生超时的时候，只要在DISKTIMEOU时间内能完成这次磁盘访问，这个节点不会被驱逐。所以，应用了这个PATCH后，就不需要增加MISSCOUNT的设置了。这个patch引入的 DISKTIMEOUT参数，这个是能够容忍的最大的对投票磁盘访问的缺省时间

下面描述了驱逐行为发生的条件：
1、都不超时，不会驱逐
2、网络PING在MC时间内完成，磁盘ping超过了MC，但是在DISKTIMEOUT内完成，也不会驱逐
3、网络ping在MC时间内完成，磁盘ping超过了DISKTIMEOUT，则节点被驱逐
4、网络ping超过MC设置，磁盘ping在MC内完成，节点也会被驱逐
缺省情况下，MC是小于DISKTIMEOUT设置的

MISSCOUNT 驱动着CLUSTER中的成员的重新配置并且直接影响对CLUSTER的访问。大多数情况下，缺省的MC设置可以被接受，改变缺省的MISSCOUNT不仅仅影响投票磁盘的I/O访问的超时时间，同时也会影响内网互连的网络心跳的超时时间。当修改缺省的MISSCOUNT值的时候应该考虑的问题：
增加MISSCOUNT的设置来解决I/O响应时间会直接导致网络失败时的重配置时间。网络心跳是cluster中的节点间连通性的主导，MISSCOUNT是触发cluster发生重配置前能够容忍发生多少的“check ins”，增加MISSCOUNT设置会延长对网络失败的诊断的时间，这将直接影响cluster的可访问性。

如果是因为底层磁盘的响应时间问题而修改了MISSCOUNT的值，那么底层地盘的问题解决后，要立刻修改MISSCOUNT回到缺省值。
如果是在第三方提供的CLUSTERWARE上实现的集群，则不要修改MISSCOUNT的缺省值，在这样的环境中修改缺省的MISSCOUNT将会导致更多的消耗和潜在的危险。
下面情况下不应该修改MISSCOUNT：
1、修改MISSCOUNT的值来避免因为底层的配置或者硬件的问题导致的超时
2、CLUSTER和数据库的可访问性直接受很高的MISSCOUNT设置的影响。

在Oracle RAC 10g Release 2 版本中允许设置多个的投票磁盘，从而不必依靠存储提供商的多路径的方式来解决磁盘访问的问题，你可以设置最多32个投票磁盘。

可以通过如下的方式来修改MISSCOUNT的设置：
1、首先停止CRS：
可以使用脚本来停止：
不同平台的脚本存放的地方如下：
* For Solaris, the scripts are in /etc/init.d/
* For HP, the scripts are in /sbin/init.d
* For AIX, the scripts are in /etc
* For Linux, the scripts are in /etc/init.d
在上面的路径上执行init.crs stop脚本来停止，执行init.crs start来启动。
在10G RELEASE2版本中可以使用crsctl stop crs停止并使用crsctl start crs来启动。

2、设置MISSCOUNT为N
1) 除了一个节点外，关闭其他所有节点的CRS
2) 执行$ORA_CRS_HOME/bin/crsctl set css misscount
其中n为投票盘的响应时间＋1
3) 重启动修改了MISSCOUNT参数的节点（这个需要测试是否是必须）
4) 启动所有其他的节点的CRS

打了4896338补丁后的10.2.0.1版本有两个附加的参数可以调整，这个改变直接包含在了10.2.0.2和10.1.0.6版本中
1) reboottime (缺省是3秒) -它定义了节点被CSS驱逐后到开始重启的时间间隔，也就是当你重启机器的时候机器完全关闭需要的时间。（这个间隔难道是为了正常关闭其他服务？）
2) disktimeout (缺省200秒) -对投票盘的I/O完成所允许的最大时间，如果达到了这个时间，则投票盘会被标识为OFFLINE。
Note that this is also the amount of time that will be required for initial cluster formation, i.e. when no nodes have previously been up and in a cluster.

$CRS_HOME/bin/crsctl set css reboottime [-force] ( is seconds)
$CRS_HOME/bin/crsctl set css disktimeout [-force] ( is seco

可以通过ocrdump命令来查看MISSCOUNT的设置，OCR备份的时间、路径、OCR的磁盘等信息

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/90618/viewspace-668717/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/90618/viewspace-668717/