开始检测voting disk上的信息
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo: node 1, vrh1, state 5 with leader 1 has smaller cluster size 1; my cluster size 2 with leader 2
发现其他子集群,包含1号节点且1号节点为该子集群的leader,为最小子集群;3号与2号节点组成最大子集群,2号节点为leader节点
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Start
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Evicting node 1, vrh1, birth 9, death 10,
impendingrcfg 1, stateflags 0x40d
发起对1号节点的驱逐
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmSendShutdown: req to node 1, kill time 443294
[ CSSD]2011-04-23 17:13:18.339 [3032460176] >TRACE: clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7eaf220), probe((nil))
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmWaitOnEvictions: node 1, vrh1, undead 1
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmCheckKillStatus: Node 1, vrh1, down, LATS(443144),timeout(150)
clssnmCheckKillStatus检查1号节点是否down了
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSendUpdate: syncSeqNo(10)
[ CSSD]2011-04-23 17:13:18.341 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(2)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 1, state (5/0) unique (1303592601/1303592601) prevConuni(1303592601) birth (9/9) (old/new)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmDeactivateNode: node 1 (vrh1) left cluster
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new)
[ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: SYNC(10) from node(3) completed
[ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:13:18.342 [89033616] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[ CSSD]2011-04-23 17:13:18.347 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(15)
[ CSSD]2011-04-23 17:13:18.348 [3032460176] >TRACE: clssnmDoSyncUpdate: Sync 10 complete!
[ CSSD]2011-04-23 17:13:18.350 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (10)
[ CSSD]2011-04-23 17:13:18.350 [3021970320] >USER: NMEVENT_RECONFIG [00][00][00][0c]
[ CSSD]2011-04-23 17:13:18.351 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2
[ CSSD]2011-04-23 17:13:18.352 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9)
[ CSSD]2011-04-23 17:13:18.353 [3063929744] >TRACE: clssgmDispatchCMXMSG(): got message type(7) src(2) incarn(10) during incarn(9/9)
[ CSSD]2011-04-23 17:13:18.354 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
.........................省略若干行.........................
[ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9)
[ CSSD]2011-04-23 17:13:18.357 [3021970320] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 10
[ CSSD]2011-04-23 17:13:18.366 [3063929744] >TRACE: clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa
[ CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(68) incarn 10
[ CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE: clssgmPeerListener: connects done (2/2)
[ CSSD]2011-04-23 17:13:18.369 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2)
更新阶段
[ CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes
[ CSSD]CLSS-3001: local node number 3, master node number 2
[ CSSD]2011-04-23 17:13:18.372 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(10), with status(1)
另一场景为1号节点未加入集群,2号节点的网络失败,因2号节点的member number较小故其通过voting disk向3号节点发起驱逐
以下为2号节点的ocssd.log日志
[ CSSD]2011-04-23 17:41:48.643 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 29.890 seconds
[ CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE: clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30110
[ CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:41:50.132 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 28.890 seconds
[ CSSD]2011-04-23 17:42:10.533 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 14.860 seconds
[ CSSD]2011-04-23 17:42:11.962 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 13.860 seconds
[ CSSD]2011-04-23 17:42:23.523 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 5.840 seconds
[ CSSD]2011-04-23 17:42:24.989 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 4.840 seconds
[ CSSD]2011-04-23 17:42:26.423 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 3.840 seconds
[ CSSD]2011-04-23 17:42:27.890 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 2.840 seconds
[ CSSD]2011-04-23 17:42:29.382 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 1.840 seconds
[ CSSD]2011-04-23 17:42:30.832 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 0.830 seconds
[ CSSD]2011-04-23 17:42:32.020 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh3 (3), flags 0x040d, state 3, wt4c 0
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 13
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ALIVE
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(13)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[13] sync[13]
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 3, vrh3, misstime(60000) state(5)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:42:32.021 [5073104] >USER: NMEVENT_SUSPEND [00][00][00][0c]
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2011-04-23 17:42:32.022 [117574544] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(13)
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: node 3, vrh3, state 5 with leader 3 has smaller cluster size 1; my cluster size 1 with leader 2
检查voting disk后发现子集群3为最小"子集群"(3号节点的node number较2号大);2号节点为最大子集群
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Start [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Evicting node 3, vrh3, birth 3, death 13, impendingrcfg 1, stateflags 0x40d
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmSendShutdown: req to node 3, kill time 1643084
发起对3号节点的驱逐和shutdown request
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmDiscHelper: vrh3, node(3) connection failed, con (0xb7e79bb0), probe((nil))
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmWaitOnEvictions: node 3, vrh3, undead 1
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmCheckKillStatus: Node 3, vrh3, down, LATS(1642874),timeout(210)
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSendUpdate: syncSeqNo(13)
[ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(1)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 1, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 3, state (5/0) unique (1303591326/1303591326) prevConuni(1303591326) birth (3/3) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmDeactivateNode: node 3 (vrh3) left cluster
[ CSSD]2011-04-23 17:42:32.024 [117574544] >USER: clssnmHandleUpdate: SYNC(13) from node(2) completed
[ CSSD]2011-04-23 17:42:32.024 [117574544] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(15)
[ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmDoSyncUpdate: Sync 13 complete!
[ CSSD]2011-04-23 17:42:32.024 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (13)
[ CSSD]2011-04-23 17:42:32.024 [3021970320] >USER: NMEVENT_RECONFIG [00][00][00][04]
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(crs_version) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
................省略若干行..............
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(#CSS_CLSSOMON) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmEstablishConnections: 1 nodes in cluster incarn 13
[ CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE: clssgmPeerDeactivate: node 3 (vrh3), death 13, state 0x0 connstate 0xf
[ CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE: clssgmPeerListener: connects done (1/1)
[ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 13 is node(2) birth(2)
[ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status
[ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete
[ CSSD]CLSS-3000: reconfiguration successful, incarnation 13 with 1 nodes
[ CSSD]CLSS-3001: local node number 2, master node number 2
完成reconfiguration
[ CSSD]2011-04-23 17:42:32.027 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(13), with status(1)
以下为3号节点的ocssd.log日志:
[ CSSD]2011-04-23 17:42:33.204 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 1.867300e-268artbeat fatal, eviction in 29.360 seconds
[ CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE: clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1039, misstime 30640
[ CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:42:55.168 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 1.867300e-268artbeat fatal, eviction in 14.330 seconds
[ CSSD]2011-04-23 17:43:08.182 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 5.310 seconds
[ CSSD]2011-04-23 17:43:09.661 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 4.300 seconds
[ CSSD]2011-04-23 17:43:11.144 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 3.300 seconds
[ CSSD]2011-04-23 17:43:12.634 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 2.300 seconds
[ CSSD]2011-04-23 17:43:14.053 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 1.300 seconds
[ CSSD]2011-04-23 17:43:15.467 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 0.300 seconds
[ CSSD]2011-04-23 17:43:15.911 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh2 (2), flags 0x040f, state 3, wt4c 0
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 13
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ALIVE
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(13)
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[29] sync[13]
[ CSSD]2011-04-23 17:43:15.912 [8136912] >USER: NMEVENT_SUSPEND [00][00][00][0c]
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 2, vrh2, misstime(60010) state(5)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE
[ CSSD]2011-04-23 17:43:15.913 [89033616] >TRACE: clssnmSendVoteInfo: node(3) syncSeqNo(13)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain. [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: : my node(3), Leader(3), Size(1) VS Node(2), Leader(2), Size(1)