一、主机环境
OS REDHAT EL 6.6
二、故障现象:
一套系统4+6一体机集群出现驱逐实例故障:
主机yxfkdb2的Grid alert日志在2016年10月16日的10:17分点,出现如下报警:
2016-10-16 10:17:18.871:
[cssd(18931)]CRS-1612:Network communication with node yxfkdb1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.540 seconds
2016-10-16 10:17:18.872:
[cssd(18931)]CRS-1612:Network communication with node yxfkdb3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.020 seconds
2016-10-16 10:17:18.872:
[cssd(18931)]CRS-1612:Network communication with node yxfkdb4 (4) missing for 50% of timeout interval. Removal of this node from cluster in 14.700 seconds
2016-10-16 10:17:25.875:
[cssd(18931)]CRS-1611:Network communication with node yxfkdb3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.010 seconds
2016-10-16 10:17:26.876:
[cssd(18931)]CRS-1611:Network communication with node yxfkdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.530 seconds
2016-10-16 10:17:26.877:
[cssd(18931)]CRS-1611:Network communication with node yxfkdb4 (4) missing for 75% of timeout interval. Removal of this node from cluster in 6.690 seconds
2016-10-16 10:17:30.878:
[cssd(18931)]CRS-1610:Network communication with node yxfkdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.530 seconds
2016-10-16 10:17:30.878:
[cssd(18931)]CRS-1610:Network communication with node yxfkdb3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.010 seconds
2016-10-16 10:17:30.878:
[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds
从日志中看到节点2与其他三个节点网络心跳丢失在规定的时间内会被驱逐,此情况多数为RAC私网心跳问题,但从oswatch中分析私有网络不存在异常;
三、故障原因分析
私有网络心跳丢失导致节点2驱逐;
2016-10-16 10:17:18.871:
[cssd(18931)]CRS-1612:Network communication with node yxfkdb1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.540 seconds
2016-10-16 10:17:18.872:
[cssd(18931)]CRS-1612:Network communication with node yxfkdb3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.020 seconds
2016-10-16 10:17:18.872:
[cssd(18931)]CRS-1612:Network communication with node yxfkdb4 (4) missing for 50% of timeout interval. Removal of this node from cluster in 14.700 seconds
2016-10-16 10:17:25.875:
[cssd(18931)]CRS-1611:Network communication with node yxfkdb3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.010 seconds
2016-10-16 10:17:26.876:
[cssd(18931)]CRS-1611:Network communication with node yxfkdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.530 seconds
2016-10-16 10:17:26.877:
[cssd(18931)]CRS-1611:Network communication with node yxfkdb4 (4) missing for 75% of timeout interval. Removal of this node from cluster in 6.690 seconds
2016-10-16 10:17:30.878:
[cssd(18931)]CRS-1610:Network communication with node yxfkdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.530 seconds
2016-10-16 10:17:30.878:
[cssd(18931)]CRS-1610:Network communication with node yxfkdb3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.010 seconds
2016-10-16 10:17:30.878:
[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds
在集群日志alertyxfkdb2.log出现节点心跳丢失,2号节点和1,3,4号无法通讯,30S后2节点在2016-10-16 10:17:30.878:节点驱逐。
网络心跳丢失一般原因如下:
1、私网网络问题2、OCCSD进程以及其子进程无法调度3、OS BUG 4、数据库HANG
1、私网的心跳问题:通过部署了OSW,
zzz ***Sun Oct1610:10:12CST2016
tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets
1 yxfkdb1-priv2 (172.16.131.33) 0.067ms 0.055ms 0.047ms
tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets
1 yxfkdb3-priv2 (172.16.131.37) 0.073ms 0.087ms 0.102ms
tracerouteto172.16.131.39 (172.16.131.39),30hopsmax,60bytepackets
1 yxfkdb4-priv2 (172.16.131.39) 0.071ms 0.060ms 0.071ms
zzz ***Sun Oct1610:10:17CST2016
tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets
1 yxfkdb1-priv2 (172.16.131.33) 0.067ms 0.060ms 0.071ms
tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets
1 yxfkdb3-priv2 (172.16.131.37) 0.107ms 0.096ms 0.089ms
tracerouteto172.16.131.39 (172.16.131.39),30hopsmax,60bytepackets
1 yxfkdb4-priv2 (172.16.131.39) 0.097ms 0.092ms 0.090ms
zzz ***Sun Oct1610:10:22CST2016
tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets
1 yxfkdb1-priv2 (172.16.131.33) 0.075ms 0.061ms 0.062ms
tracerouteto172.16.130.37 (172.16.130.37),30hopsmax,60bytepackets
connect:Networkisunreachable
tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets
1 yxfkdb3-priv2 (172.16.131.37) 0.121ms 0.092ms 0.101ms
tracerouteto172.16.130.39 (172.16.130.39),30hopsmax,60bytepackets
可以看到私网通讯正常,没有丢包发生。
2、OCCSD进程以及其子进程无法调度
在日志alertyxfkdb2.log和ocssd.log未发现not scheduled for等信息
3、数据库HANG
可以看到AWR中CPU消耗很低,等待事件均正常,且平均等待时间大多数在0毫秒。AAS不到2个.
在AWR中可以发现数据库CPU ,AAS等值均正常。
在DB ALERT和GI ALERT中可以看到是先发生节点驱逐,后导致实例重启。
1. DB ALERT
Sun Oct 16 10:17:35 2016
NOTE: ASMB terminating
Errors in file /opt/oracle/diag/rdbms/sbycfkdb/sbycfkdb2/trace/sbycfkdb2_asmb_37372.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1770 Serial number: 7
Errors in file /opt/oracle/diag/rdbms/sbycfkdb/sbycfkdb2/trace/sbycfkdb2_asmb_37372.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1770 Serial number: 7
ASMB (ospid: 37372): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 37372
Sun Oct 16 10:19:05 2016
Starting ORACLE instance (normal)
GI ALERT
[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds
2016-10-16 10:17:33.861:
[cssd(18931)]CRS-1608:This node was evicted by node 3,
数据库HANG的问题可以排除。
4、OS BUG
1. 在6.6 REDHAT内核2.6.32-504.el6.x86_64中存在一个BUG,进程存在无故HANG的情况,BUG具体描述如下:
“The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.”
RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. [May 13, 2015]
RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).
这个BUG描述,用户进程在看似不可能的情况下出现死锁或者夯住,影响的版本是6.6和7.1。
该BUG的修复在6.6.z中已经修复。
综上分析:此次节点驱逐定位到OS的BUG,建议升级KERNEL来解决该问题。
四、解决办法
升级OS KERNEL解决OS的BUG
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/31134212/viewspace-2127487/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/31134212/viewspace-2127487/