现象:
--查看crs状态
#/u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[root@ntrac1 ~]# /u01/app/oracle/grid/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE Instance Shutdown
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
ora.crf
1 ONLINE OFFLINE
ora.crsd
1 ONLINE OFFLINE
ora.cssd
1 ONLINE OFFLINE STARTING
ora.cssdmonitor
1 ONLINE ONLINE ntrac1
ora.ctssd
1 ONLINE OFFLINE
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE OFFLINE
ora.gipcd
1 ONLINE ONLINE ntrac1
ora.gpnpd
1 ONLINE ONLINE ntrac1
ora.mdnsd
1 ONLINE ONLINE ntrac1
--查看grid日志
#tail -100f $GRID_HOME/log/ntrac1/alertntrac1.log
2014-08-06 12:29:59.627:
[/u01/app/oracle/grid/bin/cssdagent(32145)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/oracle/grid/log/ntrac1/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2014-08-06 12:29:59.628:
[cssd(32230)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log
2014-08-06 12:29:59.628:
[cssd(32230)]CRS-1603:CSSD on node ntrac1 shutdown by user.
2014-08-06 12:30:04.791:
[ohasd(20111)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'ntrac1'.
2014-08-06 12:30:06.569:
[cssd(36385)]CRS-1713:CSSD daemon is started in clustered mode
2014-08-06 12:30:08.191:
[ohasd(20111)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2014-08-06 12:30:22.814:
[cssd(36385)]CRS-1707:Lease acquisition for node ntrac1 number 1 completed
2014-08-06 12:30:24.103:
[cssd(36385)]CRS-1605:CSSD voting file is online: /dev/mapper/CML_OCR02; details in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log.
2014-08-06 12:30:24.111:
[cssd(36385)]CRS-1605:CSSD voting file is online: /dev/mapper/CML_OCR03; details in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log.
2014-08-06 12:30:24.122:
[cssd(36385)]CRS-1605:CSSD voting file is online: /dev/mapper/CML_OCR01; details in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log.
--查看ocssd日志
#tail -100f $GRID_HOME/log/ntrac1/cssd/ocssd.log
2014-08-06 14:45:04.140: [ CSSD][483813120]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk
2014-08-06 14:45:04.623: [ CSSD][488544000]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2014-08-06 14:45:04.968: [ CSSD][502802176]clssnmvDHBValidateNcopy: node 2, ntrac2, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3361203, LATS 5085774, lastSeqNo 3361200, uniqueness 1406193376, timestamp 1407307504/1113874084
2014-08-06 14:45:04.968: [ CSSD][502802176]clssnmvDHBValidateNcopy: node 3, ntrac3, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3360733, LATS 5085774, lastSeqNo 3360730, uniqueness 1406193385, timestamp 1407307504/1113864924
2014-08-06 14:45:05.105: [ CSSD][498054912]clssnmvDHBValidateNcopy: node 2, ntrac2, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3361205, LATS 5085904, lastSeqNo 3361202, uniqueness 1406193376, timestamp 1407307504/1113874544
2014-08-06 14:45:05.105: [ CSSD][498054912]clssnmvDHBValidateNcopy: node 3, ntrac3, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3360735, LATS 5085904, lastSeqNo 3360732, uniqueness 1406193385, timestamp 1407307504/1113864974
解决:
--查看网络连接,发现问题
从其它节点ping故障节点的私网IP地址,发现是ping不通的。
--查看crs状态
#/u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[root@ntrac1 ~]# /u01/app/oracle/grid/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE Instance Shutdown
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
ora.crf
1 ONLINE OFFLINE
ora.crsd
1 ONLINE OFFLINE
ora.cssd
1 ONLINE OFFLINE STARTING
ora.cssdmonitor
1 ONLINE ONLINE ntrac1
ora.ctssd
1 ONLINE OFFLINE
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE OFFLINE
ora.gipcd
1 ONLINE ONLINE ntrac1
ora.gpnpd
1 ONLINE ONLINE ntrac1
ora.mdnsd
1 ONLINE ONLINE ntrac1
--查看grid日志
#tail -100f $GRID_HOME/log/ntrac1/alertntrac1.log
2014-08-06 12:29:59.627:
[/u01/app/oracle/grid/bin/cssdagent(32145)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/oracle/grid/log/ntrac1/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2014-08-06 12:29:59.628:
[cssd(32230)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log
2014-08-06 12:29:59.628:
[cssd(32230)]CRS-1603:CSSD on node ntrac1 shutdown by user.
2014-08-06 12:30:04.791:
[ohasd(20111)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'ntrac1'.
2014-08-06 12:30:06.569:
[cssd(36385)]CRS-1713:CSSD daemon is started in clustered mode
2014-08-06 12:30:08.191:
[ohasd(20111)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2014-08-06 12:30:22.814:
[cssd(36385)]CRS-1707:Lease acquisition for node ntrac1 number 1 completed
2014-08-06 12:30:24.103:
[cssd(36385)]CRS-1605:CSSD voting file is online: /dev/mapper/CML_OCR02; details in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log.
2014-08-06 12:30:24.111:
[cssd(36385)]CRS-1605:CSSD voting file is online: /dev/mapper/CML_OCR03; details in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log.
2014-08-06 12:30:24.122:
[cssd(36385)]CRS-1605:CSSD voting file is online: /dev/mapper/CML_OCR01; details in /u01/app/oracle/grid/log/ntrac1/cssd/ocssd.log.
--查看ocssd日志
#tail -100f $GRID_HOME/log/ntrac1/cssd/ocssd.log
2014-08-06 14:45:04.140: [ CSSD][483813120]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk
2014-08-06 14:45:04.623: [ CSSD][488544000]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2014-08-06 14:45:04.968: [ CSSD][502802176]clssnmvDHBValidateNcopy: node 2, ntrac2, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3361203, LATS 5085774, lastSeqNo 3361200, uniqueness 1406193376, timestamp 1407307504/1113874084
2014-08-06 14:45:04.968: [ CSSD][502802176]clssnmvDHBValidateNcopy: node 3, ntrac3, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3360733, LATS 5085774, lastSeqNo 3360730, uniqueness 1406193385, timestamp 1407307504/1113864924
2014-08-06 14:45:05.105: [ CSSD][498054912]clssnmvDHBValidateNcopy: node 2, ntrac2, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3361205, LATS 5085904, lastSeqNo 3361202, uniqueness 1406193376, timestamp 1407307504/1113874544
2014-08-06 14:45:05.105: [ CSSD][498054912]clssnmvDHBValidateNcopy: node 3, ntrac3, has a disk HB, but no network HB, DHB has rcfg 301688209, wrtcnt, 3360735, LATS 5085904, lastSeqNo 3360732, uniqueness 1406193385, timestamp 1407307504/1113864974
解决:
--查看网络连接,发现问题
从其它节点ping故障节点的私网IP地址,发现是ping不通的。
初步确定是网络原因,可能是网卡的问题,后来发现是有故障节点的有一根私有网卡上的一根网线没有插上,插上就重新启动crs就没问题了。
参考:
http://sqlsewer.blogspot.com/2013/07/oracle-crs-is-not-starting-has-disk-hb.html