本帖最后由 ftc007 于 2012-6-7 09:53 编辑
以上是这个过程的CRS日志:
2012-06-05 18:06:55.023 --开始关机
[cssd(10293)]CRS-1654:Clean up of CRSD resources finished successfully.
2012-06-05 18:06:55.024
[cssd(10293)]CRS-1655:CSSD on node oracle02 detected a problem and started to shutdown. --关闭本节点的crs
2012-06-05 18:06:55.035
[/opt/app/11.2.0/grid/bin/orarootagent.bin(10867)]CRS-5822:Agent '/opt/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117
{0:1:4} in /opt/app/11.2.0/grid/log/oracle02/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-06-05 18:06:55.034
[/opt/app/11.2.0/grid/bin/oraagent.bin(10864)]CRS-5822:Agent '/opt/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117
{0:2:6} in /opt/app/11.2.0/grid/log/oracle02/agent/crsd/oraagent_grid/oraagent_grid.log.
2012-06-05 18:06:55.117
[ohasd(9758)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle02'.
2012-06-05 18:06:55.195
[cssd(10293)]CRS-1660:The CSS daemon shutdown has completed
2012-06-05 18:06:55.512
[ohasd(9758)]CRS-2765:Resource 'ora.diskmon' has failed on server 'oracle02'.
2012-06-05 18:06:55.531
[ohasd(9758)]CRS-2765:Resource 'ora.evmd' has failed on server 'oracle02'.
2012-06-05 18:06:55.539
[ohasd(9758)]CRS-2765:Resource 'ora.ctssd' has failed on server 'oracle02'.
2012-06-05 18:06:56.154
[crsd(11278)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109
in /opt/app/11.2.0/grid/log/oracle02/crsd/crsd.log.
2012-06-05 18:06:56.528
[ohasd(9758)]CRS-2765:Resource 'ora.cssd' has failed on server 'oracle02'.
2012-06-05 18:06:56.546
[ohasd(9758)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle02'.
2012-06-05 18:06:56.558
[ohasd(9758)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'oracle02'.
2012-06-05 18:06:57.323
[ctssd(11289)]CRS-2402:The Cluster Time Synchronization Service aborted on host oracle02. Details at (:ctss_css_init1
in /opt/app/11.2.0/grid/log/oracle02/ctssd/octssd.log.
2012-06-05 18:06:58.279
[/opt/app/11.2.0/grid/bin/oraagent.bin(10199)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006
" in "/opt/app/11.2.0/grid/log/oracle02/agent/ohasd/oraagent_grid/oraagent_grid.log"
2012-06-05 18:06:58.466
[ohasd(9758)]CRS-2765:Resource 'ora.asm' has failed on server 'oracle02'.
2012-06-05 18:07:06.870
[cssd(11333)]CRS-1713:CSSD daemon is started in clustered mode --准备启动crs,在启动之后,它也需要重新检测各节点的健康状况,下面5行,就是这个过程,但比较久,不正常
2012-06-05 18:07:21.817
[ohasd(9758)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'oracle02'.
2012-06-05 18:07:22.548
[cssd(11333)]CRS-1707:Lease acquisition for node oracle02 number 2 completed
2012-06-05 18:07:23.872
[cssd(11333)]CRS-1605:CSSD voting file is online: ORCL:CRSVOL3; details in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log.
2012-06-05 18:07:23.888
[cssd(11333)]CRS-1605:CSSD voting file is online: ORCL:CRSVOL2; details in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log.
2012-06-05 18:07:23.898
[cssd(11333)]CRS-1605:CSSD voting file is online: ORCL:CRSVOL1; details in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log.
^[[A^[[A2012-06-05 18:09:26.767
[cssd(11333)]CRS-1625:Node oracle01, number 1, was manually shut down --节点2来到这里才意识到节点1崩溃,这个好像有点奇怪,其实主要是这里之前花的时间比较多,整个过程花了三分钟,后面都比较快
2012-06-05 18:09:30.670
[cssd(11333)]CRS-1601:CSSD Reconfiguration complete. Active nodes are oracle02 .
2012-06-05 18:10:30.793
[cssd(11333)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011
in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log
2012-06-05 18:10:34.327
[cssd(11623)]CRS-1713:CSSD daemon is started in clustered mode
2012-06-05 18:10:36.708
[ohasd(9758)]CRS-2765:Resource 'ora.diskmon' has failed on server 'oracle02'.
2012-06-05 18:10:50.103
[cssd(11623)]CRS-1707:Lease acquisition for node oracle02 number 2 completed
2012-06-05 18:10:51.524
[cssd(11623)]CRS-1605:CSSD voting file is online: ORCL:CRSVOL3; details in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log.
2012-06-05 18:10:51.532
[cssd(11623)]CRS-1625:Node oracle01, number 1, was manually shut down
2012-06-05 18:10:51.552
[cssd(11623)]CRS-1605:CSSD voting file is online: ORCL:CRSVOL2; details in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log.
2012-06-05 18:10:51.576
[cssd(11623)]CRS-1605:CSSD voting file is online: ORCL:CRSVOL1; details in /opt/app/11.2.0/grid/log/oracle02/cssd/ocssd.log.
2012-06-05 18:10:59.113
[cssd(11623)]CRS-1601:CSSD Reconfiguration complete. Active nodes are oracle02 .
2012-06-05 18:11:01.715
[ctssd(11714)]CRS-2407:The new Cluster Time Synchronization Service reference node is host oracle02.
2012-06-05 18:11:01.714
[ctssd(11714)]CRS-2401:The Cluster Time Synchronization Service started on host oracle02.
2012-06-05 18:11:02.428
[/opt/app/11.2.0/grid/bin/oraagent.bin(10199)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006
" in "/opt/app/11.2.0/grid/log/oracle02/agent/ohasd/oraagent_grid/oraagent_grid.log"
2012-06-05 18:11:02.680
[/opt/app/11.2.0/grid/bin/oraagent.bin(10199)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006
" in "/opt/app/11.2.0/grid/log/oracle02/agent/ohasd/oraagent_grid/oraagent_grid.log"
2012-06-05 18:11:22.096
[crsd(11870)]CRS-1012:The OCR service started on node oracle02.
2012-06-05 18:11:22.897
[evmd(11286)]CRS-1401:EVMD started on node oracle02.
2012-06-05 18:11:24.284
[crsd(11870)]CRS-1201:CRSD started on node oracle02.
2012-06-05 18:11:25.832
[/opt/app/11.2.0/grid/bin/oraagent.bin(11981)]CRS-5016:Process "/opt/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/opt/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010
" in "/opt/app/11.2.0/grid/log/oracle02/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-06-05 18:11:25.842
[/opt/app/11.2.0/grid/bin/oraagent.bin(11981)]CRS-5016:Process "/opt/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/opt/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010
" in "/opt/app/11.2.0/grid/log/oracle02/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-06-05 18:11:27.438
[/opt/app/11.2.0/grid/bin/oraagent.bin(11981)]CRS-5016:Process "/opt/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/opt/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/opt/app/11.2.0/grid/log/oracle02/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-06-05 18:11:27.575
[crsd(11870)]CRS-2772:Server 'oracle02' has been assigned to pool 'Free'.
从日志来分析,感觉好像是共享存储的切换比较慢造成的,以下是ocssd的日志信息:
2012-06-05 18:09:25.930: [ CSSD][1086961984]clssgmDiscEndpcl: gipcDestroy 0x1faa
2012-06-05 18:09:26.151: [ CSSD][1118615872]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2012-06-05 18:09:26.767: [ CSSD][1107577152]clssnmvDHBValidateNCopy: node 1, oracle01, has a disk HB, but no network HB, DHB has
rcfg 234372137, wrtcnt, 26905, LATS 4294594190, lastSeqNo 26900, uniqueness 1338889661, timestamp 1338890966/536754
2012-06-05 18:09:26.767: [ CSSD][1107577152]clssnmReadDskHeartbeat:manual shutdown of nodename oracle01, nodenum 1 epoch 13388909
66 msec 536754
2012-06-05 18:09:27.153: [ CSSD][1118615872]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2012-06-05 18:09:28.155: [ CSSD][1118615872]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2012-06-05 18:09:29.157: [ CSSD][1118615872]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2012-06-05 18:09:29.661: [ CSSD][1121769792]clssnmSendingThread: sending join msg to all nodes
2012-06-05 18:09:29.661: [ CSSD][1121769792]clssnmSendingThread: sent 4 join msgs to all nodes
2012-06-05 18:09:29.940: [ CSSD][1086961984]clssscSelect: cookie accept request 0x2aaaac024f00
2012-06-05 18:09:29.940: [ CSSD][1086961984]clssgmAllocProc: (0x1cc455c0) allocated
2012-06-05 18:09:29.941: [ CSSD][1086961984]clssgmClientConnectMsg: properties of cmProc 0x1cc455c0 - 1,2,3,4,5
2012-06-05 18:09:29.941: [ CSSD][1086961984]clssgmClientConnectMsg: Connect from con(0x2039) proc(0x1cc455c0) pid(11200) version
11:2:1:4, properties: 1,2,3,4,5
2012-06-05 18:09:29.941: [ CSSD][1086961984]clssgmClientConnectMsg: msg flags 0x0000
2012-06-05 18:09:29.942: [ CSSD][1086961984]clssscSelect: cookie accept request 0x1cc455c0
2012-06-05 18:09:29.942: [ CSSD][1086961984]clssscevtypSHRCON: getting client with cmproc 0x1cc455c0
2012-06-05 18:09:29.942: [ CSSD][1086961984]clssgmRegisterClient: proc(3/0x1cc455c0), client(1/0x1d146830)
2012-06-05 18:09:29.943: [ CSSD][1086961984]clssgmJoinGrock: global grock CRF- new client 0x1d146830 with con 0x2068, requested n
um -1, flags 0x4000600
2012-06-05 18:09:29.943: [ CSSD][1086961984]clssgmJoinGrock: ignoring grock join for client not requiring fencing until group inf
ormation has been received from the master; group name CRF-, member number -1, flags 0x4000600
2012-06-05 18:09:29.943: [ CSSD][1086961984]clssgmDiscEndpcl: gipcDestroy 0x2068
2012-06-05 18:09:29.943: [ CSSD][1086961984]clssgmDeadProc: proc 0x1cc455c0
2012-06-05 18:09:29.944: [ CSSD][1086961984]clssgmDestroyProc: cleaning up proc(0x1cc455c0) con(0x2039) skgpid ospid 11200 with
0 clients, refcount 0
2012-06-05 18:09:29.944: [ CSSD][1086961984]clssgmDiscEndpcl: gipcDestroy 0x2039
2012-06-05 18:09:30.159: [ CSSD][1118615872]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2012-06-05 18:09:30.662: [ CSSD][1123346752]clssnmRcfgMgrThread: Local Join
2012-06-05 18:09:30.662: [ CSSD][1123346752]clssnmLocalJoinEvent: begin on node(2), waittime 193000
2012-06-05 18:09:30.662: [ CSSD][1123346752]clssnmLocalJoinEvent: set curtime (4294598080) for my node
2012-06-05 18:09:30.662: [ CSSD][1123346752]clssnmLocalJoinEvent: scanning 32 nodes
2012-06-05 18:09:30.662: [ CSSD][1123346752]clssnmLocalJoinEvent: Node oracle01, number 1, was shut down
2012-06-05 18:09:30.664: [ CSSD][1123346752]clssnmLocalJoinEvent: Starting initial cluster reconfig
2012-06-05 18:09:30.664: [ CSSD][1123346752]clssnmDoSyncUpdate: Initiating sync 0
2012-06-05 18:09:30.664: [ CSSD][1123346752]clssscCompareSwapEventValue: changed NMReconfigInProgress val 2, from -1, changes 1
2012-06-05 18:09:30.664: [ CSSD][1123346752]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 2
00000
2012-06-05 18:09:30.664: [ CSSD][1123346752]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will ta
ke effect when the sync is completed.
从以上描述来看,节点2等待了193秒后才找到共享磁盘,然后才成功把CRS服务起来,感觉可能是这个方向有异常,大家意见怎样?楼上的朋友怎么看?