( 1 ) ocssd.log ( 10gR2版本)
节点1:
[CSSD]2013-ll-12 12:53:08.286 (3053439888] >TRACE: clssnmSendingThread: sending
status msg to all nodes
[CSSD] 2013-11-12 12: 53: 08 .287 (3053439888] >TRACE: clssnmSendingThread: sent 4
status msgs to all nodes
[CSSD] 2013-11-12 12: 53 :08. 437 (3063929744] >WARNING: clssnmPollingThread: node
*** (2) at 50% heartbeat fatal, eviction in 29.410 seconds seedhbimpd 0
[CSSD] 2013-11-12 12: 53: 08. 437 (3063929744] >TRACE: clssnmPollingThread: node
** (2) is impending reconfig, flag 1037, misstime 30590
从此段日志中能清楚地看到, 线程clssnmSendingThread在向所有的节点发送网络心跳。
当节点1 的私网出现问题之后(大概30s), 线程clssnmPollingThread 发现了节点2已经在连续一段时间( misscount时间的50%) 内丢失网络心跳。如果问题继续下去, 那么会在29.410s后发生节点驱逐。
节点2:
[CSSD]2013-ll-12 12:53:03.941 [3063929744] >TRACE: clssnmSendingThread: sending
status msg to all nodes
[CSSD] 2013-11-12 12: 53: 03. 941 [3063929744] >TRACE: clssnmSendingThread: sent 4
status msgs to all nodes
[CSSD]2013-ll-12 12:53:06 241 (3074419600] >WARNING: clssnmPollingThread: node
*** (1) at 50% heartbeat fatal, eviction in 29.240 seconds seedhbimpd 0
[CSSD]2013-ll-12 12:53:06 241 [3074419600] >TRACE: clssnmPollingThread: node
** * (1) is impending reconfig, flag 1037, misstime 30760
以上可以看到 节点2 的信息和节点1很相似, 这是正常的, 因为对于两个节点的集群, 当其中一个节点的私网出现问题时, 造成的结果就会是两个节点互相抱怨对方没有网络心跳。当丢失网络心跳持续了misscount时间(在这个测试集群中为60s)时, 决定驱逐哪个节点还需要再看一下两个节点的ocssd.log。
节点1
[CSSD] 2013-11-12 12 :53 :23. 437 (3063929744] >WARNING: clssnmPollingThread: node
** * (2) at 75% heartbeat fatal, eviction in 14.410 seconds * * * * ***
[CSSD] 2013-11-12 12: 53: 37. 436 (3063929744] >WARNING: clssnmPollingThread: node
** (2) at 90% heartbeat fatal, eviction in 0.410 seconds *****
[CSSD] 2013-11-12 12: 53: 37. 857 (3063929744] >TRACE: clssnm Polling Thread:
Eviction started for node ** * (2), flags Ox040d, state 3, wt4cO seedhbimpd
节点2:
[CSSD) 2013-11-12 12: 53: 21. 251 [ 307 4 419600) >WARNING: clssnmPollingThread: node
(1) at 75% heartbeat fatal, eviction in 14.240 seconds * * * * ** *
[CSSD) 2013-11-12 12: 53: 35. 251 [ 307 4 419600 J >WARNING: clssnmPollingThread: node
** * (1) at 90% heartbeat fatal, eviction in 0.240 seconds * * * * * * *
[CSSD)2013-ll-12 12:53:35.501 [3074419600) >T R A C E: c l s s n mPo ll i n g T hre a d:
Eviction started for node *** (1), flags Ox040d, state 3, wt4c O * * * * * * *
以上的信息看起来很有意思, 节点1 声明要驱逐节点2, 同时节点2声明要驱逐节点1。但是, 结合之前提到的分析线程的功能, 可以看到这种情况代表该线程做了自己应该做的事情, 接下来, 就看集群是如何进行重新配置的, 到底哪一个节点会离开集群。
节点1:
[CSSD) 2013-11