一次RAC单实例DOWN机的诊断

最新推荐文章于 2024-09-15 08:03:51 发布

ckawt40802

最新推荐文章于 2024-09-15 08:03:51 发布

阅读量108

点赞数

文章标签：数据库操作系统

客户的10.2.0.4 RAC环境前几天出现了单节点DOWN机，帮客户诊断问题所在。

首先检查数据库的告警日志：

Fri Jun 17 18:32:17 2011
ospid 14177: network interface with IP address 192.169.1.2 no longer running (check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmon_14175.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms0_14179.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms3_14191.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms4_14195.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms6_14203.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms5_14199.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms2_14187.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms7_14207.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms1_14183.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmd0_14177.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_asmb_14255.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_diag_14171.trc

可以看到，告警日志中出现了很多ORA-29702错误，导致这个错误的原因显然是CLUSTER层出现了问题，再看最后出现的ORA-15064和ORA-3113错误，说明ASM实例也出现了问题，导致数据库实例崩溃。

此外，错误一开始的时候出现了网络地址不可用的问题，以前偶尔在其他RAC环境的告警日志中也看到过类似的错误，一般是出现了这个错误后，1分钟后再次检测的时候，网络地址就恢复了正常，而类似这种单独出现的很少见。问题是否会与这个地址告警有关还不好说，不过既然现有错误提示和ASM以及CLUSTER有关系，那么首先检查ASM上的错误日志：

Fri Jun 17 18:32:08 2011
ospid 13967: network interface with IP address 192.169.1.2 no longer running (check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/+ASM/bdump/+asm2_lmon_13964.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file /ora10g/oracle/admin/+ASM/bdump/+asm2_diag_13955.trc
Fri Jun 17 18:32:24 2011
Trace dumping is performing id=[cdmp_20110617183224]

ASM的错误日志并无新奇之处，错误和数据库实例相仿，包含了网络地址不可用的信息，以及ORA-29702错误，ASM的问题是CLUSTER上的错误导致的。

继续检查CLUSTER日志：

2011-06-17 18:31:52.706
[cssd(12435)]CRS-1612:node tgpsndb1 (1) at 50% heartbeat fatal, eviction in 29.094 seconds
2011-06-17 18:32:07.706
[cssd(12435)]CRS-1611:node tgpsndb1 (1) at 75% heartbeat fatal, eviction in 14.094 seconds
2011-06-17 18:32:16.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 5.084 seconds
2011-06-17 18:32:17.707
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 4.094 seconds
2011-06-17 18:32:18.709
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 3.094 seconds
2011-06-17 18:32:19.711
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 2.094 seconds
2011-06-17 18:32:20.713
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 1.094 seconds
2011-06-17 18:32:21.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 0.094 seconds
[cssd(12435)]Invalid msg id [1609]
2011-06-17 18:36:21.314
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.

很显然，这时由于网络心跳超时导致的当前节点被踢出CLUSTER。

那么造成网络超时的原因是否和前面实例以及ASM错误日志中的IP地址不可用有关呢，继续检查操作系统上的日志：

Jun 17 18:31:23 tgpsndb2 kernel: igb: eth1 NIC Link is Down
Jun 17 18:31:23 tgpsndb2 kernel: igb: eth0 NIC Link is Down
Jun 17 18:32:22 tgpsndb2 logger: Oracle CSSD failure 134.
Jun 17 18:32:22 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 17 18:32:22 tgpsndb2 kernel: SysRq : Resetting
Jun 17 18:32:22 tgpsndb2 logger: Oracle clsomon failed with fatal status 12.
Jun 17 18:32:23 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 17 18:32:23 tgpsndb2 logger: Oracle init script. ceding reboot to sibling 11929.
Jun 17 18:32:24 tgpsndb2 logger: Cluster Ready Services completed waiting on dependencies.

在错误发生时刻，操作系统上两个网卡都DOWN掉了，而此后就是CLUSTER的错误以及准备重启的信息，显然这次实例自动重启的真正原因就在于此。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/4227/viewspace-701570/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/4227/viewspace-701570/