node evict

最新推荐文章于 2022-03-02 17:35:11 发布

congmo2314

最新推荐文章于 2022-03-02 17:35:11 发布

阅读量145

点赞数

一、主机环境

OS REDHAT EL 6.6

二、故障现象：

一套系统4+6一体机集群出现驱逐实例故障：

主机yxfkdb2的Grid alert日志在2016年10月16日的10：17分点，出现如下报警：

2016-10-16 10:17:18.871:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.540 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.020 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb4 (4) missing for 50% of timeout interval. Removal of this node from cluster in 14.700 seconds

2016-10-16 10:17:25.875:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.010 seconds

2016-10-16 10:17:26.876:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.530 seconds

2016-10-16 10:17:26.877:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb4 (4) missing for 75% of timeout interval. Removal of this node from cluster in 6.690 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.530 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.010 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds

从日志中看到节点2与其他三个节点网络心跳丢失在规定的时间内会被驱逐，此情况多数为RAC私网心跳问题，但从oswatch中分析私有网络不存在异常；

三、故障原因分析

私有网络心跳丢失导致节点2驱逐；

2016-10-16 10:17:18.871:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.540 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.020 seconds

2016-10-16 10:17:18.872:

[cssd(18931)]CRS-1612:Network communication with node yxfkdb4 (4) missing for 50% of timeout interval. Removal of this node from cluster in 14.700 seconds

2016-10-16 10:17:25.875:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.010 seconds

2016-10-16 10:17:26.876:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.530 seconds

2016-10-16 10:17:26.877:

[cssd(18931)]CRS-1611:Network communication with node yxfkdb4 (4) missing for 75% of timeout interval. Removal of this node from cluster in 6.690 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.530 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.010 seconds

2016-10-16 10:17:30.878:

[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds

在集群日志alertyxfkdb2.log出现节点心跳丢失，2号节点和1，3,4号无法通讯，30S后2节点在2016-10-16 10:17:30.878:节点驱逐。

网络心跳丢失一般原因如下：

1、私网网络问题2、OCCSD进程以及其子进程无法调度3、OS BUG 4、数据库HANG

1、私网的心跳问题：通过部署了OSW，

zzz ***Sun Oct1610:10:12CST2016

tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets

1 yxfkdb1-priv2 (172.16.131.33) 0.067ms 0.055ms 0.047ms

tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets

1 yxfkdb3-priv2 (172.16.131.37) 0.073ms 0.087ms 0.102ms

tracerouteto172.16.131.39 (172.16.131.39),30hopsmax,60bytepackets

1 yxfkdb4-priv2 (172.16.131.39) 0.071ms 0.060ms 0.071ms

zzz ***Sun Oct1610:10:17CST2016

tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets

1 yxfkdb1-priv2 (172.16.131.33) 0.067ms 0.060ms 0.071ms

tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets

1 yxfkdb3-priv2 (172.16.131.37) 0.107ms 0.096ms 0.089ms

tracerouteto172.16.131.39 (172.16.131.39),30hopsmax,60bytepackets

1 yxfkdb4-priv2 (172.16.131.39) 0.097ms 0.092ms 0.090ms

zzz ***Sun Oct1610:10:22CST2016

tracerouteto172.16.131.33 (172.16.131.33),30hopsmax,60bytepackets

1 yxfkdb1-priv2 (172.16.131.33) 0.075ms 0.061ms 0.062ms

tracerouteto172.16.130.37 (172.16.130.37),30hopsmax,60bytepackets

connect:Networkisunreachable

tracerouteto172.16.131.37 (172.16.131.37),30hopsmax,60bytepackets

1 yxfkdb3-priv2 (172.16.131.37) 0.121ms 0.092ms 0.101ms

tracerouteto172.16.130.39 (172.16.130.39),30hopsmax,60bytepackets

可以看到私网通讯正常，没有丢包发生。

2、OCCSD进程以及其子进程无法调度

在日志alertyxfkdb2.log和ocssd.log未发现not scheduled for等信息

3、数据库HANG

可以看到AWR中CPU消耗很低，等待事件均正常，且平均等待时间大多数在0毫秒。AAS不到2个.

在AWR中可以发现数据库CPU ,AAS等值均正常。

2Q==

在DB ALERT和GI ALERT中可以看到是先发生节点驱逐，后导致实例重启。

1. DB ALERT

Sun Oct 16 10:17:35 2016

NOTE: ASMB terminating

Errors in file /opt/oracle/diag/rdbms/sbycfkdb/sbycfkdb2/trace/sbycfkdb2_asmb_37372.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 1770 Serial number: 7

Errors in file /opt/oracle/diag/rdbms/sbycfkdb/sbycfkdb2/trace/sbycfkdb2_asmb_37372.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 1770 Serial number: 7

ASMB (ospid: 37372): terminating the instance due to error 15064

Instance terminated by ASMB, pid = 37372

Sun Oct 16 10:19:05 2016

Starting ORACLE instance (normal)

GI ALERT

[cssd(18931)]CRS-1610:Network communication with node yxfkdb4 (4) missing for 90% of timeout interval. Removal of this node from cluster in 2.690 seconds

2016-10-16 10:17:33.861:

[cssd(18931)]CRS-1608:This node was evicted by node 3,

数据库HANG的问题可以排除。

4、OS BUG

1. 在6.6 REDHAT内核2.6.32-504.el6.x86_64中存在一个BUG，进程存在无故HANG的情况，BUG具体描述如下：

“The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.”

RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.

RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. [May 13, 2015]

RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).