oracle 踢连接,Oracle RAC 10.1.0 - 11.1.0.7引起节点被踢出的原因

最新推荐文章于 2021-04-13 11:15:25 发布

MaxWhut2017

最新推荐文章于 2021-04-13 11:15:25 发布

阅读量124

点赞数

文章标签： oracle 踢连接

There are 4 possible causes why a node eviction can occur.

Kernel Hang/ extreem load on the system. (OPROCD and/or HANCHECK TIMER)

Heartbeat lost Interconnect

Heartbeat lost Voting Disk

OCLSMON detects CSSD hang.

The title start with cause, but an Node eviction is a symptom of another problem not the cause. Keep this always in mind when investigating why a node eviction can occur.

Kernel Hang depended on the Operation System used. For Window or Linux this can be done based on the Hangcheck Timer and other Unix environments OPROCD is started. From Oracle 10.2.0.4 and higher OPROCD is also active on LINUX. (Still install the hangcheck timer) To validate if HANGCHECK timer or OPROCD was causing the node eviction validate the OS logfiles for the hangcheck timer. For OPROCD validate the OPROCD logfile.

An other possible node eviction can be triggered by OCLSMON starting with the 10.2.0.3 patchset or higher. The Clusterware proces is validating if there is an issue with CSSD. When this is the case it will kill the CSSD deamon, which will lead to the eviction. When this issue occur validate the oclsmon logfile and contact Oracle support. In this note we don’t focus on these parts, but on heartbeat lost.

Below are two examples of a heartbeat lost symptom. The OCSSD background process is taking care of the heartbeats. In the cssd.log file you can find detail information about the node eviction. In case of an eviction validate all the cssd.log file on all the nodes in your cluster environment. But start with the evicted node. The logging information logged can be changed during patchset and Oracle releases.

Node eviction due to Interconnect lost symptom.

Oracle 11g

[ CSSD]2008-11-20 10:59:36.510 [1220598112] >TRACE: clssnmCheckDskSleepTime: Node 3, dbq0223,

dead, last DHB (1227175136, 73583764) after NHB (1227175121, 73568724), but LATS - current (39090) >

DTO (27000)[ CSSD]2008-11-20 10:59:36.512 [1147169120] >TRACE: clssnmReadDskHeartbeat: node 1, dbq0123,has a disk HB, but no network HB, DHB has rcfg 122475875, wrtcnt, 164452, LATS 58728604, lastSeqNo

164452, timestamp 1227175122/73251784[ CSSD]2008-11-20 10:59:37.513 [1199618400] >WARNING: clssnmPollingThread: node dbq0227 (5) at

90% heartbeat fatal, eviction in 1.660 seconds

[ CSSD]2008-11-20 10:59:37.513 [1220598112] >TRACE: clssnmSendSync: syncSeqNo(122475875)

[ CSSD]2008-11-20 10:59:37.513 [1220598112] >TRACE: clssnm_print_syncacklist: syncacklist (4)

Oracle 10g

[ CSSD]2006-10-18 23:49:06.199 [3600] >TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2006-10-18 23:49:06.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(172) state_network(0) state_disk(3) missCount(30)

[ CSSD]2006-10-18 23:49:06.226 [1] >USER: NMEVENT_SUSPEND [00][00][00][06] [ CSSD]2006-10-18 23:49:07.028 [1030] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634353) LATS(2345204583) Disk lastSeqNo(634353) [ CSSD]2006-10-18 23:49:07.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) disk HB found, network state 0, disk state(3) missCount(31)

[ CSSD]2006-10-18 23:49:08.032 [1030] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634354) LATS(2345205587) Disk lastSeqNo(634354)[ CSSD]2006-10-18 23:49:08.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) disk HB found, network state 0, disk state(3) missCount(32)

[ CSSD]2006-10-18 23:49:09.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(1167) state_network(0) state_disk(3) missCount(33)

[ CSSD]2006-10-18 23:49:10.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(2167) state_network(0) state_disk(3) missCount(33)

…….

[ CSSD]2006-10-18 23:49:18.571 [3086] >WARNING: clssnmPollingThread: state(0) clusterState(2) exit

[ CSSD]2006-10-18 23:49:18.572 [1287] >ERROR: clssnmvDiskKillCheck: Evicted by node 1, sync 23, stamp -1949751541,[ CSSD]2006-10-18 23:49:18.698 [3600] >TRACE: 0x110013a80 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Here we see that the Diskkillcheck is report by node 1 and this node is evicted.

The diskkillcheck is done using a poison packets trough the voting disk, as interconnect is lost.

Possible action:check the availability of the Adapters, large network load/port scans and the OS logfiles for reported errrors related to the interconnect.

Node eviction due to Voting disk lost symptom.

Below an example where we lose the heartbeat to the voting disk.

[ CSSD]2006-10-11 00:35:33.658 [1801] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[alligator] seq[9] sync[15]

[ CSSD]2006-10-11 00:35:36.956 [1801] >TRACE: clssnmHandleSync: diskTimeout set to (27000)ms[ CSSD]2006-10-11 00:35:36.957 [1801] >WARNING: CLSSNMCTX_NODEDB_UNLOCK: lock held for 3300 ms[ CSSD]2006-10-11 00:35:36.956 [1544] >TRACE: clssnmDiskPMT: stale disk (32490 ms) (0//dev/rora_vote_raw)[ CSSD]2006-10-11 00:35:36.966 [1544] >ERROR: clssnmDiskPMT: 1 of 1 voting disks unavailable (0/0/1)[ CSSD]2006-10-11 00:35:37.043 [2058] >TRACE: clssgmClientConnectMsg: Connect from con(112a8a9f0) proc(112a8f9d0) pid(480150) proto(10:2:1:1)

[ CSSD]2006-10-11 00:35:37.960 [3343] >TRACE: clscsendx: (11145a3f0) Physical connection (111459b30) not active [ CSSD]2006-10-11 00:35:37.051 [1] >USER: NMEVENT_SUSPEND [00][00][00]06]

Possible action:check the availability of the Disk subsystem and the OS logfiles for reported errrors related to the voting disk

Trace the heartbeat: If needed you can enable a higher level of tracing to debug the heartbeat part. This can be done using the command, level 5 tracing. Level 0 disables the extra trace again. Please keep in mind that this can make your cssd.log growth hard. (4 lines added every second).

crsctl debug log css CSSD:5

crsctl debug log css CSSD:0

NOTICE: Node evictions is a symptom for another problem !

MaxWhut2017

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
oracle 踢连接,Oracle RAC 10.1.0 - 11.1.0.7引起节点被踢出的原因

There are 4 possible causes why a node eviction can occur.Kernel Hang/ extreem load on the system. (OPROCD and/or HANCHECK TIMER)Heartbeat lost InterconnectHeartbeat lost Voting DiskOCLSMON detects ...
复制链接

扫一扫