记一次RAC Brain Split脑裂分析过程

最新推荐文章于 2022-11-18 10:16:42 发布

cuizhi4718

最新推荐文章于 2022-11-18 10:16:42 发布

阅读量638

点赞数

1.环境介绍：
DB版本：11.2.0.4 64位
OS版本：AIX 6.1 位
2.错误现象
node1 节点1ASM实例日志：
Mon Jan 12 09:08:48 2015
Reconfiguration started (old inc 8, new inc 10)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:08:48 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:08:48 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=10.4585 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DG)
NOTE: SMON successfully validated lock domain 1
NOTE: advancing ckpt for group 1 (DG) thread=2 ckpt=10.4585
NOTE: SMON did instance recovery for group DG domain 1
NOTE: SMON starting instance recovery for group VOTE domain 2 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: starting recovery of thread=2 ckpt=15.26 group=2 (VOTE)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 2 (VOTE)
NOTE: SMON successfully validated lock domain 2
NOTE: advancing ckpt for group 2 (VOTE) thread=2 ckpt=15.26
NOTE: SMON did instance recovery for group VOTE domain 2
Mon Jan 12 09:21:06 2015
Reconfiguration started (old inc 10, new inc 12)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:21:06 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:25:08 2015
Reconfiguration started (old inc 12, new inc 14)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:25:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:51:08 2015
Reconfiguration started (old inc 14, new inc 18)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:51:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 10:15:52 2015
Reconfiguration started (old inc 18, new inc 20)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 10:15:53 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:06 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more
details.
Mon Jan 12 12:20:26 2015
Reconfiguration started (old inc 20, new inc 22)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 12:20:26 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:26 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=11.4610 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
Mon Jan 12 12:32:50 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 13:44:50 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Private Interface 'en0' configured from GPnP for use as a private interconnect.
[name='en0', type=1, ip=169.254.189.151, mac=00-90-fa-52-21-be, net=169.254.0.0/16,
mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'en2' configured from GPnP for use as a public interface.
[name='en2', type=1, ip=192.168.0.12, mac=00-90-fa-52-22-3c, net=192.168.0.0/24,
mask=255.255.255.0, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/grid/asm/dbs/arch
Autotune of undo retention is turned on.c

说明：主节点node1的asm日志中显示，asm实例分别在Mon Jan 12 09:08:48 2015
，Mon Jan 12 09:21:06 2015，Mon Jan 12 09:25:08 2015，Mon Jan 12 09:51:08 2015，Mon Jan 12 10:15:52 2015，Mon Jan 12 12:20:26 2015多次进行reconfiguration，也就是asm实例不断的重启，出现脑裂现象，且en0为private网卡，IP地址为169.254.189.151，en2 为public网卡，IP地址为192.168.0.12。

3.分析过程
node2 节点2ASM实例日志
Mon Jan 12 09:09:29 2015
NOTE: client exited [4719298]
Mon Jan 12 09:09:29 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 09:09:29 2015
PMON (ospid: 3867214): terminating the instance due to error 471
Mon Jan 12 09:51:49 2015
Shutting down instance (abort)
License high water mark = 2
USER (ospid: 2425590): terminating the instance
Instance terminated by USER, pid = 2425590
Mon Jan 12 09:51:50 2015
Instance shutdown complete
Mon Jan 12 10:16:27 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Mon Jan 12 12:21:07 2015
NOTE: client exited [2098012]
Mon Jan 12 12:21:07 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 12:21:09 2015
PMON (ospid: 2818726): terminating the instance due to error 481
Instance terminated by PMON, pid = 2818726
Mon Jan 12 12:33:30 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Mon Jan 12 12:38:19 2015
NOTE: client jzhprd2:jzhprd registered, osid 3146138, mbr 0x1
Mon Jan 12 12:38:51 2015
ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.13)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2';
Mon Jan 12 13:01:34 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 13:01:34 2015
NOTE: client exited [4915336]
NOTE: force a map free for map id 2
Mon Jan 12 13:01:36 2015
PMON (ospid: 3212222): terminating the instance due to error 481
Instance terminated by PMON, pid = 3212222
Mon Jan 12 13:46:02 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
说明：标红字体显示节点JZH2中ASMB进程在9:9:29s时开始出现异常，ASM异常需要检查CRS、CSS相关日志。

node 2 节点2CRSD日志

2015-01-12 09:09:30.742: [ CSSCLNT][1]clssscConnect: gipc request failed with 13 (1a)

CSS失败了（CSS，cluster synchronization services集群同步服务-涉及netwok hearbeat，disk heartbeat两种机制）

2015-01-12 09:09:30.742: [ CSSCLNT][1]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_)) failed, rc 13

2015-01-12 09:09:30.745: [ CRSRTI][1] CSS is not ready. Received status 3
CSS not ready

2015-01-12 09:09:30.745: [ CRSMAIN][1] First attempt: init CSS context failed. Error = 3

[ clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzh2DBG_CRSD))

第一次偿试，失败了

2015-01-12 09:09:30.812: [ clsdmt][515]PID for the Process [4522624], connkey 1

2015-01-12 09:09:30.812: [ clsdmt][515]Creating PID [4522624] file for home /u01/app/oracle/grid/asm host jzh2 bin crs to /u01/app/oracle/grid/asm/crs/init/

2015-01-12 09:09:30.812: [ clsdmt][515]Writing PID [4522624] to the file [/u01/app/oracle/grid/asm/crs/init/jzh2.pid]

2015-01-12 09:09:31.863: [ CRSMAIN][1] CRS Daemon Starting--&gt CRS staring（crsd服务没有问题，一会验证）

说明：以上说明CRS服务没有问题，在09：09：30s时CSS集群同步服务出现异常，集群同步服务涉及到disk heartbeat（磁盘心跳）network heartbear(网络心跳)，也就是说网络与磁盘心跳有问题, 接下来看一下jzh1 ocssd.log记录node1 jzh1在2015-01-12 09:09:30干什么？。

node1 节点1 CSSD日志：
2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:16.758: [ CSSD][2577]clssnmSetupReadLease: status 1

2015-01-12 09:09:16.762: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1

2015-01-12 09:09:16.762: [ CSSD][2577]clssgmDoneQEle: re-queueing req 110617b30 status 1

2015-01-12 09:09:16.763: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (111b38850), operation status 1, client status 0

2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sent 3 status msgs to all nodes

2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 0, events 0x8, state 0x0

2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 1, events 0x8, state 0x0

说明：可以看到node1的CSSD进程send status messagesto all nondes，为什么要send？在RAC启动后(在node1的ASM日志中可以看到在09:08:48s时reconfiguration了)，各个node要将自己的信息写入ocr与vote中，然后 master收集这些信息发送给所有node，告诉所有的node，谁是master，有几个node，在votedisk中记录node相关信息，然后进行投票，到这里，我们可以看到整个集群中有两个member，分别是member 0(jzh1)和member 1（jzh2），也就是说CRSD进程没有问题(已验证)，还说明什么？其他node可以将自己的信息写入vote，就是说disk heatbeat没什么问题(一会验证)。

接着往下看：
2015-01-12 09:17:02.539: [ CSSD][4376]clssscSelect: cookie accept request 110991628

2015-01-12 09:17:02.539: [ CSSD][4376]clssnmeventhndlr: gipcAssociate endp 1d2198b in container 73 type of conn gipcha

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnSetNames: hostname jzh2 privname 10.10.0.20 con 1d2198b--&gt连接jzh2 ，private IP为10.10.0.20(记得在jzh1的asm日志中显示private IP为169.254.189.151)

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSetNodeProperties: properties node 2 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17--&gtjzh2 属性

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnComplete: node node2 softver 11.2.0.4.0--&gtnode2的software版本

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: Incoming connect from node 2 (node2) ninf endp 0, probendp 0, endp 1d2198b

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSendConnAck: connected to node 2, jzh2, con (1d2198b), state 0

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: node jzh2, 2, uniqueness 1421024974, msg uniqueness 1421024974, endp 1d2198b probendp 0 endp 1d2198b

2015-01-12 09:17:03.044: [ CSSD][4376]clssnmHandleJoin: node 2 JOINING, state 0->1 ninfendp 1d2198b

2015-01-12 09:17:03.354: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:17:03.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/jzh2), LATSvalid(0), nodeInfoDHB uniqueness(1420692326)àread disk heartbeat

2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(jzh2) number(2), nodeInfoDHB(1420692326), readInfo(1421024974)

2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Saving DHB uniqueness for node jzh2, number 2 latestInfo(1421024974), readInfo(1421024974), nodeInfoDHB(1420692326)à保存jzh2的disk heartbeat信息

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 315891915

2015-01-12 09:17:03.754: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 20

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000--&gt设置disk heartbeat(磁盘心跳设置200s，11g默认200s)

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed--&gt本地与远程disk heartbeat生效

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 315891915

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11)

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE--&gtnode 1 是alive活的

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE--&gt node 2 是alive活的

说明： node1 jzh1觉得jzh 1,jzh2 都是活的，注意：这里还是 disk heartbeat ，再一次验证 disk heatbeat 没什么问题。

接着往下看：

2015-01-12 09:24:20.392: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) at 50% heartbeat fatal, removal in 14.647 seconds--&gtnode1 cssd进程检查jzh2了，到了50%失败，14.647s(记住这个时间)要移除jzh2，上面说disk hearbeat没有问题，这里为什么会报错？

2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) is impending reconfig, flag 2294796, misstime 15353--&gtmisstime 15.353s (记住这个时间)+ 14.647s=30s

2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)--&gtdisk timeout被设置成27s了，不是200s吗？

2015-01-12 09:24:23.578: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771770, LATS 706581505, lastSeqNo 19771331, uniqueness 1421024974, timestamp 1421025905/706212683à原来node2 jzh2的disk heartbeat可以检测到，所以不需要200s了，DHB has rcfg，再次验证heartbeat 没有问题，but no network HB，难道network heartbeat有问题？

2015-01-12 09:24:23.618: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025863/706581544

2015-01-12 09:24:24.082: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771771, LATS 706582008, lastSeqNo 19771770, uniqueness 1421024974, timestamp 1421025905/706213190

2015-01-12 09:24:24.119: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025864/706582045--&gtdisk ping错误了。

2015-01-12 09:24:24.398: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes--&gtnode1 要告诉大家什么呢？

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmNeedConfReq: No configuration to change

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, jzh2, misstime(31566) state(5)--&gt要终止node2 jzh2了，misstime为31.566s，记得上面时间是15.353+14.647=30s，这是oracle网络心跳默认最大阀值30s

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)--&gt要更新votedisk，要投票了

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckSplit: Node 2, jzh2, is alive, DHB (1421025918, 706226248) more than disk timeout of 27000 after the last

NHB (1421025890, 706197520)--&gt再次验证disk heartbeat没有问题

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 1

2015-01-12 09:24:39.805: [ CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, jzh2, from the cluster in incarnation 315891916, node birth incarnation 315891915, death incarnation 315891916, stateflags 0x234000 uniqueness value 1421024974—>node1 要驱逐node2 jzh2了

2015-01-12 09:24:39.806: [ default][4119]kgzf_gen_node_reid2: generated reid cid=41daa0e19d0a6f84ff29b9f37a2f1a38,icin=315891908,nmn=2,lnid=315891915,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmrFenceSage: Fenced node jzh2, number 2, with EXADATA, handle 0

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: req to node 2, kill time 706597731--&gtnode1要将 node2 shutdown kill

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmsendmsg: not connected to node 2--&gt连不上node2

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: Send to node 2 failed--&gt为了保证数据一致，要将node2 shutdown,但是shutown 失败

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmWaitOnEvictions: Start--&gt开始驱逐。

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh1, number 1, current state 3, proposed state 3, current unique 1420396557, proposed u

nique 1420396557, prevConuni 0, birth 315891909

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh2, number 2, current state 5, proposed state 0, current unique 1421024974, proposed u

nique 1421024974, prevConuni 1421024974, birth 315891915

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2, state 5

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2 (jzh2) left cluster--&gtnode2 jzh2离开了cluster

2015-01-12 10:11:27.825: [ CSSD][4119]clssnmWaitForAcks: Ack message type(11), ackCount(2)

2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: Node jzh1, number 1, is EXADATA fence capable

2015-01-12 10:11:27.825: [ CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 33

2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000--&gt本地和远程disk timeout设置为200s

2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: initleader 1 newleader 1--&gtnode1 是leader了，也就是master node

说明： 根据以上分析，磁盘心跳没有问题，问题出现在网络心跳。
node 2节点2CSSD日志

2015-01-12 09:25:19.224: [ CSSD][1]clssgmSuspendAllGrocks: done

2015-01-12 09:25:19.224: [ CSSD][1]clssgmCompareSwapEventValue: changed CmInfo State val 2, from 5, changes 13

2015-01-12 09:25:19.224: [ CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 315891915, changes 5

2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(315891915)

2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(315891915)

2015-01-12 09:25:19.224: [ CSSD][1]clssgmStartNMMon: completed node cleanup

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSendSync: syncSeqNo(315891916), indicating EXADATA fence initialization complete

2015-01-12 09:25:19.224: [ CSSD][4119]List of nodes that have ACKed my sync: 2

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmWaitForAcks: done, syncseq(315891916), msg type(11)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion:node2 product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmNeedConfReq: No configuration to change

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, jzh1, misstime(30000) state(5)--&gtnode2 jzh2与node1同步，misstime 30s（网络心跳阀值），要终止node1。

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)--&gt等待投票。

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckSplit: Node 1, jzh1, is alive, DHB (1421025877, 706595081) more than disk timeout of 27000 after the last

NHB (1421025847, 706565177)--&gtnode1 jzh1 disk heartbeat没有问题

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2--&gt本地编号

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1--&gtnode1 jzh1活着

2015-01-12 09:25:19.224: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, jzh2, is smaller than cohort of 1 nodes led by node 1, jzh1, based on map type 2à终止本地节点node 2 jzh2,node1 jzh1为leader。

2015-01-12 09:25:19.224: [ CSSD][4119]###################################

2015-01-12 09:25:19.224: [ CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread--&gtCSSD在调用clssnmRcfgMgrThread时终止