对于Cluster_interconnect ,如果节点之间心跳网卡的MTU设置不同 ,可能会造成实例无法启动。 |
APPLIES TO:
应用于:
Information in this document applies to any platform.
***Checked for relevance on 07-Jan-2010***
Oracle服务器 - 企业版 - 版本9.0.1.0到11.2.0.3 [版本9.0.1到11.2]
本文档中的信息适用于任何平台。
SYMPTOMS
问题描述:
If the MTU size on the network cards used for the interconnect differs on the cluster member nodes, RAC instance(s) will not start
在各个集群节点上,如果用于互连的网卡(心跳网卡)MTU参数大小不同, 则RAC实例将不会启动。
CHANGES
变动部分:
Network configuration
网络参数配置
CAUSE
具体原因(主要通过分析网络参数配置和查看日志):
The MTU size is set on the private network interface, for example, two interfaces of two cluster members:
MTU大小在专用网络接口(用于节点间互连的心跳网络接口)上,比如下面列出了集群上两个节点的网络配置
:
node 1
eth0 Link encap:Ethernet HWaddr 00:0E:0C:08:4B:D5
inet addr: xxx.x.x.x Bcast:xxx.x.x.x Mask:255.255.255.0
inet6 addr: fe80::20e:cff:fe08:4bd5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
node 2
eth0 Link encap:Ethernet HWaddr 00:0E:0C:08:03:59
inet addr: xxx.x.x.x Bcast:xxx.x.x.x Mask:255.255.255.0
inet6 addr: fe80::20e:cff:fe08:359/64 Scope:Link
UP BROADCAST RUNNING MULTICAST *MTU:1500* Metric:1
eth0 Link encap:Ethernet HWaddr 00:0E:0C:08:4B:D5
inet addr: xxx.x.x.x Bcast:xxx.x.x.x Mask:255.255.255.0
inet6 addr: fe80::20e:cff:fe08:4bd5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
node 2
eth0 Link encap:Ethernet HWaddr 00:0E:0C:08:03:59
inet addr: xxx.x.x.x Bcast:xxx.x.x.x Mask:255.255.255.0
inet6 addr: fe80::20e:cff:fe08:359/64 Scope:Link
UP BROADCAST RUNNING MULTICAST *MTU:1500* Metric:1
If you have different MTU sizes configured a startup will hang with following error in the alert-log:
一旦你设置了不同的MTU值,startup数据库后将会挂起,并报出以下日志:
Tue Mar 1 01:50:35 2005
lmon registered with NM - instance id 2 (internal mem no 1)
Tue Mar 1 01:50:36 2005
Reconfiguration started (old inc 0, new inc 2)
List of nodes:
0 1
Global Resource Directory frozen
Update rdomain variables
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Tue Mar 1 01:55:44 2005
IPC Send timeout to 0.0 inc 9 for msg type 53 from opid 5
Tue Mar 1 01:59:25 2005
Trace dumping is performing id=[cdmp_20050301095925]
Tue Mar 1 01:59:31 2005
Reconfiguration started (old inc 2, new inc 3)
List of nodes:
lmon registered with NM - instance id 2 (internal mem no 1)
Tue Mar 1 01:50:36 2005
Reconfiguration started (old inc 0, new inc 2)
List of nodes:
0 1
Global Resource Directory frozen
Update rdomain variables
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Tue Mar 1 01:55:44 2005
IPC Send timeout to 0.0 inc 9 for msg type 53 from opid 5
Tue Mar 1 01:59:25 2005
Trace dumping is performing id=[cdmp_20050301095925]
Tue Mar 1 01:59:31 2005
Reconfiguration started (old inc 2, new inc 3)
List of nodes:
1
Typically you see timeouts in the alert-file and in the traces of background processes (LMD and LMON).
从警报日志与后台程序(LMD和LMON)的跟踪记录中,很容易看出超时。
SOLUTION
解决方案:
- Identifiy the interface being used by Oracle RAC using oradebug ipc - Metalink note 181489.1
用Oadebug 工具的IPC 命令 打印出的IPC信息,定位Rac中的导致数据库hang住的网络接口。
- Check the network configuration, for example with ifconfig, for example: /sbin/ifconfig eth0
用相关命令检查心跳网卡的网络配置
- Ping the ip-address of the network card with a packetsize that should fit for all interfaces. Use -M switch to avoid packet splitting, for example:
ping <nodename> -s <biggest-size-that fits> -M do
使用适合所有心跳网络接口的数据包大小,来ping 故障网卡的IP地址,看哪个值能够ping通。使用-M do 用以避免IP分片。
例如:
ping <nodename> -s <biggest-size-that fits> -M do
- Configure the cluster interconnect interfaces to have the same MTU size on all cluster member nodes
经过上步的测试,选出合适的MTU值,用这个MTU值配置所有集群节点的互连的网卡(心跳网卡)的网络参数配置。
PS:通过查看官方文档(
http://docs.oracle.com/cd/B19306_01/server.102/b14237/initparams025.htm#REFRN10017
)得知一下信息:
CLUSTER_INTERCONNECTS参数定义一个私有网络,这个参数将影响GCS和GES服务网络接口的选择。
该参数主要用于以下目的:
1.覆盖默认的内联网络
2.单一的网络带宽不能满足RAC数据库的带宽要求,增加带宽。
CLUSTER_INTERCONNECTS将信息存储在集群注册表中,明确覆盖以下内容:
1.存储在OCR中通过oifcfg命令查看的网络分类。
2Oracle选择的默认内部连接。
该参数默认值是空,可以包含一到多个IP地址,用冒号分隔。