oracle 踢连接,Oracle RAC 10.1.0 - 11.1.0.7引起节点被踢出的原因

There are  4 possible causes why a node eviction can occur.

Kernel Hang/ extreem load on the system.  (OPROCD and/or HANCHECK TIMER)

Heartbeat lost Interconnect

Heartbeat lost Voting Disk

OCLSMON detects CSSD hang.

The title start with cause, but an Node eviction is a symptom of another problem not the cause. Keep this always in mind when investigating why a node eviction can occur.

Kernel Hang depended on the Operation System used. For Window or Linux this can be done based on the Hangcheck Timer and other Unix environments OPROCD is started. From Oracle 10.2.0.4 and higher OPROCD is also active on LINUX. (Still install the hangcheck timer) To validate if HANGCHECK timer or OPROCD was causing the node eviction validate the OS logfiles for the hangcheck timer. For OPROCD validate the OPROCD logfile.

An other possible node eviction can be triggered by OCLSMON starting with the 10.2.0.3 patchset or higher. The Clusterware proces is validating if there is an issue with CSSD. When this is the case it will kill the CSSD deamon, which will lead to the eviction. When this issue occur validate the oclsmon logfile and contact Oracle support. In this note we don’t focus on these parts, but on heartbeat lost.

Below are two examples of a heartbeat lost symptom. The OCSSD background process is taking care of the heartbeats. In the cssd.log file you can find detail information about the node eviction. In case of an eviction validate all the cssd.log file on all the nodes in your cluster environment. But start with the evicted node. The logging information logged can be changed during patchset and Oracle releases.

Node eviction due to Interconnect lost symptom.

Oracle 11g

[    CSSD]2008-11-20 10:59:36.510 [1220598112] >TRACE:  clssnmCheckDskSleepTime: Node 3, dbq0223,

dead, last DHB (1227175136, 73583764) after NHB (1227175121, 73568724), but LATS - current (39090) >

DTO (27000)[    CSSD]2008-11-20 10:59:36.512 [1147169120] >TRACE:  clssnmReadDskHeartbeat: node 1, dbq0123,has a disk HB, but no network HB, DHB has rcfg 122475875, wrtcnt, 164452, LATS 58728604, lastSeqNo

164452, timestamp 1227175122/73251784[    CSSD]2008-11-20 10:59:37.513 [1199618400] >WARNING: clssnmPollingThread: node dbq0227 (5) at

90% heartbeat fatal, eviction in 1.660 seconds

[    CSSD]2008-11-20 10:59:37.513 [1220598112] >TRACE:  clssnmSendSync: syncSeqNo(122475875)

[    CSSD]2008-11-20 10:59:37.513 [1220598112] >TRACE:  clssnm_print_syncacklist: syncacklist (4)

Oracle 10g

[    CSSD]2006-10-18 23:49:06.199 [3600] >TRACE:   clssnmCheckDskInfo: Checking disk info... [    CSSD]2006-10-18 23:49:06.199 [3600] >TRACE:   clssnmCheckDskInfo: node(2) timeout(172) state_network(0) state_disk(3) missCount(30)

[    CSSD]2006-10-18 23:49:06.226 [1] >USER:    NMEVENT_SUSPEND [00][00][00][06] [    CSSD]2006-10-18 23:49:07.028 [1030] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634353) LATS(2345204583) Disk lastSeqNo(634353) [    CSSD]2006-10-18 23:49:07.199 [3600] >TRACE:   clssnmCheckDskInfo: node(2) disk HB found, network state 0, disk state(3) missCount(31)

[    CSSD]2006-10-18 23:49:08.032 [1030] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634354) LATS(2345205587) Disk lastSeqNo(634354)[    CSSD]2006-10-18 23:49:08.199 [3600] >TRACE:   clssnmCheckDskInfo: node(2) disk HB found, network state 0, disk state(3) missCount(32)

[    CSSD]2006-10-18 23:49:09.199 [3600] >TRACE:   clssnmCheckDskInfo: node(2) timeout(1167) state_network(0) state_disk(3) missCount(33)

[    CSSD]2006-10-18 23:49:10.199 [3600] >TRACE:   clssnmCheckDskInfo: node(2) timeout(2167) state_network(0) state_disk(3) missCount(33)

…….

[    CSSD]2006-10-18 23:49:18.571 [3086] >WARNING: clssnmPollingThread: state(0) clusterState(2) exit

[    CSSD]2006-10-18 23:49:18.572 [1287] >ERROR:   clssnmvDiskKillCheck: Evicted by node 1, sync 23, stamp -1949751541,[    CSSD]2006-10-18 23:49:18.698 [3600] >TRACE:   0x110013a80 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Here we see that the Diskkillcheck is report by node 1 and this node is evicted.

The diskkillcheck is done using a poison packets trough the voting disk, as interconnect is lost.

Possible action:check the availability of the Adapters, large network load/port scans and the OS logfiles for reported errrors related to the interconnect.

Node eviction due to  Voting disk lost symptom.

Below an example where we lose the heartbeat to the voting disk.

[    CSSD]2006-10-11 00:35:33.658 [1801] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[alligator] seq[9] sync[15]

[    CSSD]2006-10-11 00:35:36.956 [1801] >TRACE:   clssnmHandleSync: diskTimeout set to (27000)ms[    CSSD]2006-10-11 00:35:36.957 [1801] >WARNING: CLSSNMCTX_NODEDB_UNLOCK: lock held for 3300 ms[    CSSD]2006-10-11 00:35:36.956 [1544] >TRACE:   clssnmDiskPMT: stale disk (32490 ms) (0//dev/rora_vote_raw)[    CSSD]2006-10-11 00:35:36.966 [1544] >ERROR:   clssnmDiskPMT: 1 of 1 voting disks unavailable (0/0/1)[    CSSD]2006-10-11 00:35:37.043 [2058] >TRACE:   clssgmClientConnectMsg: Connect from con(112a8a9f0) proc(112a8f9d0) pid(480150) proto(10:2:1:1)

[    CSSD]2006-10-11 00:35:37.960 [3343] >TRACE:   clscsendx: (11145a3f0) Physical connection (111459b30) not active [    CSSD]2006-10-11 00:35:37.051 [1] >USER:    NMEVENT_SUSPEND [00][00][00]06]

Possible action:check the availability of the Disk subsystem  and the OS logfiles for reported errrors related to the voting disk

Trace the heartbeat: If needed you can enable a higher level of tracing to debug the heartbeat part. This can be done using the command, level 5 tracing. Level 0 disables the extra trace again. Please keep in mind that this can make your cssd.log growth hard. (4 lines added every second).

crsctl debug log css CSSD:5

crsctl debug log css CSSD:0

NOTICE: Node evictions is a symptom for another problem !

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值