Oracle RAC CSS 超时计算及参数 misscount，Disktimeout 说明 .

最新推荐文章于 2021-04-16 05:57:04 发布

ciyangliln703681

最新推荐文章于 2021-04-16 05:57:04 发布

阅读量140

点赞数

文章标签：数据库嵌入式 shell

http://blog.csdn.net/tianlesoftware/article/details/6728885

一. 概述

在之前的文章：

RAC 的一些概念性和原理性的知识

http://blog.csdn.net/tianlesoftware/article/details/5331067

提到OCSSD 这个进程是Clusterware最关键的进程，如果这个进程出现异常，会导致系统重启，这个进程提供CSS(Cluster Synchronization Service)服务。 CSS 服务通过多种心跳机制实时监控集群状态，提供脑裂保护等基础集群服务功能。

CSS 服务有2种心跳机制：一种是通过私有网络的Network Heartbeat，另一种是通过Voting Disk的Disk Heartbeat.

这2种心跳都有最大延时，对于Disk Heartbeat，这个延时叫作IOT (I/O Timeout);对于Network Heartbeat, 这个延时叫MC(Misscount)。这2个参数都以秒为单位，缺省时IOT大于MC，在默认情况下，这2个参数是Oracle 自动判定的，并且不建议调整。

可以通过如下命令来查看参数值：

$crsctl get css disktimeout

$crsctl get css misscount

如：

[oracle@rac1 ~]$ crsctl get css disktimeout

200

[oracle@rac1 ~]$ crsctl get css misscount

这是这2个参数的默认值。

二. MOS 上相关的几篇文章

How to start/stop the 10g CRS ClusterWare[ID 309542.1]

10g RAC: Steps To Increase CSS Misscount,Reboottime and Disktimeout [ID 284752.1]

CSS Timeout Computation in OracleClusterware [ID 294430.1]

RAC Assurance Support Team: RAC and OracleClusterware Starter Kit and Best Practices (Generic) [ID 810394.1]

2.1修改CSS Misscount 步骤：

1)Shut down CRS on all but one node. For exact steps use Note 309542.1

2)Execute crsctl as root to modify the misscount:

$ORA_CRS_HOME/bin/crsctl set css misscount

where is the maximum i/o latency to the voting disk +1 second

3)Reboot the node where adjustment was made

4)Start all other nodes shutdown in step 1

With the Patch:4896338 for 10.2.0.1 thereare two additional settings that can be tuned. This change is incorporated into the 10.2.0.2 and 10.1.0.6patchsets.

These following are only relevant on10.2.0.1 with Patch:4896338，In addition to MissCount, CSS now has two more parameters:

1)reboottime (default 3 seconds) - the amount of time allowed for a node to complete a reboot after the CSS daemon hasbeen evicted. (I.E. how long does ittake for the machine to completely shutdown when you do a reboot)

2)disktimeout (default 200 seconds) - the maximum amount of time allowed for a voting file I/O to complete; if thistime is exceeded the voting disk will be marked as offline. Note that this is also the amount of timethat will be required for initial cluster formation, i.e. when no nodes havepreviously been up and in a cluster.

$CRS_HOME/bin/crsctl set css reboottime [-force] ( is seconds)

$CRS_HOME/bin/crsctl set css disktimeout [-force] (is seconds)

Confirm the new css misscount setting via ocrdump

2.2 CSS Timeout Computation in OracleClusterware

2.2.1 MISSCOUNTDEFINITION AND DEFAULT VALUES
The CSS misscount parameterrepresents the maximum time, in seconds, that a network heartbeat can be missedbefore entering into a cluster reconfiguration to evict the node. The followingare the default values for the misscount parameter and their respectiveversions when using Oracle Clusterware* in seconds:

*CSS misscount default value when using vendor (non-Oracle)clusterware is 600 seconds. This is to allow the vendor clusterwareample time to resolve any possible split brain scenarios.

On AIX platforms with HACMP starting with 10.2.0.3 BP#1, themisscount is 30. This is documented in Note551658.1

2.2.2 CSS HEARTBEATMECHANISMS AND THEIR INTERRELATIONSHIP
The synchronization servicescomponent (CSS) of the Oracle Clusterware maintains two heartbeat mechanisms

1.) the disk heartbeat to the voting deviceand

2.) the network heartbeat across theinterconnect which establish and confirm valid node membership in the cluster.

Bothof these heartbeat mechanisms have an associated timeout value. The diskheartbeat has an internal i/o timeout interval (DTO Disk TimeOut), in seconds,where an i/o to the voting disk must complete. The misscount parameter (MC), asstated above, is the maximum time, in seconds, that a network heartbeat can be missed. The disk heartbeat i/o timeout interval is directly related tothe misscount parameter setting. There has been some variation in thisrelationship
between versions as described below:

9.x.x.x	NOTE, MISSCOUNT WAS A DIFFERENT ENTITY IN THIS RELEASE
10.1.0.2	No one should be on this version
10.1.0.3	DTO = MC - 15 seconds
10.1.0.4	DTO = MC - 15 seconds
10.1.0.4+Unpublished Bug 3306964	DTO = MC - 3 seconds
10.1.0.4 with CRS II Merge patch	DTO =Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration
10.1.0.5	IOT = MC - 3 seconds
10.2.0.1 +Fix for unpublished Bug 4896338	IOT=Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration
10.2.0.2	Same as above (10.2.0.1 with Patch Bug:4896338
10.1 - 11.1	During node join and leave (reconfiguration) in a cluster we need to reconfigure, in that particular case we use Short Disk TimeOut (SDTO) which is in all versions SDTO = MC â