node evict

一、主机环境

OS REDHAT EL 6.6

 

二、故障现象:

 

一套系统4+6一体机集群出现驱逐实例故障:

 

主机yxfkdb2Grid alert日志在20161016日的1017分点,出现如下报警:

 

2016-10-16 10:17:18.871:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.540 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.020 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb4 (4) missing for 50% of timeout interval. Removal of this node from cluster in 14.700 seconds

2016-10-16 10:17:25.875:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.010 seconds

2016-10-16 10:17:26.876:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.530 seconds

2016-10-16 10:17:26.877:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb4 (4) missing for 75% of timeout interval. Removal of this node from cluster in 6.690 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.530 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.010 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds

 

从日志中看到节点2与其他三个节点网络心跳丢失在规定的时间内会被驱逐,此情况多数为RAC私网心跳问题,但从oswatch中分析私有网络不存在异常;

 

三、故障原因分析

私有网络心跳丢失导致节点2驱逐;

2016-10-16 10:17:18.871:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.540 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.020 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb4 (4) missing for 50% of timeout interval. Removal of this node from cluster in 14.700 seconds

2016-10-16 10:17:25.875:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.010 seconds

2016-10-16 10:17:26.876:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.530 seconds

2016-10-16 10:17:26.877:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb4 (4) missing for 75% of timeout interval. Removal of this node from cluster in 6.690 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.530 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.010 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds

 

在集群日志alertyxfkdb2.log出现节点心跳丢失,2号节点和13,4号无法通讯,30S2节点在2016-10-16 10:17:30.878:节点驱逐。

 

网络心跳丢失一般原因如下:

1、私网网络问题2OCCSD进程以及其子进程无法调度3OS BUG 4、数据库HANG

 

1、私网的心跳问题:通过部署了OSW

 zzz ***Sun Oct1610:10:12CST2016

tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets

 1 yxfkdb1-priv2 (172.16.131.33) 0.067ms 0.055ms 0.047ms

tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets

 1 yxfkdb3-priv2 (172.16.131.37) 0.073ms 0.087ms 0.102ms

tracerouteto172.16.131.39 (172.16.131.39),30hopsmax,60bytepackets

 1 yxfkdb4-priv2 (172.16.131.39) 0.071ms 0.060ms 0.071ms

zzz ***Sun Oct1610:10:17CST2016

tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets

 1 yxfkdb1-priv2 (172.16.131.33) 0.067ms 0.060ms 0.071ms

tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets

 1 yxfkdb3-priv2 (172.16.131.37) 0.107ms 0.096ms 0.089ms

tracerouteto172.16.131.39 (172.16.131.39),30hopsmax,60bytepackets

 1 yxfkdb4-priv2 (172.16.131.39) 0.097ms 0.092ms 0.090ms

zzz ***Sun Oct1610:10:22CST2016

tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets

 1 yxfkdb1-priv2 (172.16.131.33) 0.075ms 0.061ms 0.062ms

tracerouteto172.16.130.37 (172.16.130.37),30hopsmax,60bytepackets

connect:Networkisunreachable

tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets

 1 yxfkdb3-priv2 (172.16.131.37) 0.121ms 0.092ms 0.101ms

tracerouteto172.16.130.39 (172.16.130.39),30hopsmax,60bytepackets

可以看到私网通讯正常,没有丢包发生。

 

2OCCSD进程以及其子进程无法调度

在日志alertyxfkdb2.logocssd.log未发现not scheduled for等信息 

 

3、数据库HANG

可以看到AWRCPU消耗很低,等待事件均正常,且平均等待时间大多数在0毫秒。AAS不到2.

AWR中可以发现数据库CPU ,AAS等值均正常。

2Q==

 

ZZ

 DB ALERTGI ALERT中可以看到是先发生节点驱逐,后导致实例重启。

1.     DB ALERT

Sun Oct 16 10:17:35 2016

NOTE: ASMB terminating

Errors in file /opt/oracle/diag/rdbms/sbycfkdb/sbycfkdb2/trace/sbycfkdb2_asmb_37372.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 1770 Serial number: 7

Errors in file /opt/oracle/diag/rdbms/sbycfkdb/sbycfkdb2/trace/sbycfkdb2_asmb_37372.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 1770 Serial number: 7

ASMB (ospid: 37372): terminating the instance due to error 15064

Instance terminated by ASMB, pid = 37372

Sun Oct 16 10:19:05 2016

Starting ORACLE instance (normal)

 

GI ALERT

[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds

2016-10-16 10:17:33.861:

[cssd(18931)]CRS-1608:This node was evicted by node 3,

数据库HANG的问题可以排除。

 

4OS BUG

1.     6.6 REDHAT内核2.6.32-504.el6.x86_64中存在一个BUG,进程存在无故HANG的情况,BUG具体描述如下:

“The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.”

 

RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.

RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix.  [May 13, 2015]

RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).

 

 

这个BUG描述,用户进程在看似不可能的情况下出现死锁或者夯住,影响的版本是6.67.1

 BUG的修复在6.6.z中已经修复。

 综上分析:此次节点驱逐定位到OSBUG,建议升级KERNEL来解决该问题。

 

四、解决办法

升级OS KERNEL解决OSBUG

 

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/31134212/viewspace-2127487/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/31134212/viewspace-2127487/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值