适用于:
Oracle Server - Enterprise Edition - Version11.2.0.2 and later
Information in this document applies to anyplatform.
用途:
Rebootless防护在11.2.0.2 GridInfrastructure中引入,在驱逐发生时,它将尝试在被驱逐的节点上正常停止GI,而不是重新启动节点,以避免节点重新启动。如果重新引导防护失败,则驱逐的节点将重新启动。此文档列出了重新引导防护故障的常见原因。
详细信息:
1.资源无法停止。
如果一个或多个资源无法停止,则rebootless fencing将失败,并且将重新启动节点。
在这种情况下,在节点2脑裂后rebootless fencing失败,node2将重启:
驱逐节点<GI_HOME>/log/<node>/alert<node>.log
..
2012-09-11 12:04:34.363
[cssd(18834)]CRS-1610:Network communication with node racnode1 (1) missing for90% of timeout interval. Removal of this node from cluster in 2.020seconds
2012-09-11 12:04:36.379
[cssd(18834)]CRS-1609:This node is unable to communicate with other nodes inthe cluster and is going down to preserve cluster integrity; details at(:CSSNM00008:) in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-09-11 12:04:36.379
[cssd(18834)]CRS-1656:The CSS daemon is terminating due to a fatal error;Details at (:CSSSC00012:) in /ocw/grid/log/racnode2/cssd/ocssd.log
2012-09-11 12:04:36.399
[cssd(18834)]CRS-1652:Starting clean up of CRSD resources.
2012-09-11 12:04:36.586
[crsd(26115)]CRS-5833:Cleaning resource 'zDRMON.sh.racnode2 1 1' failed as partof reboot-less node fencing
2012-09-11 12:04:36.588
[cssd(18834)]CRS-1653:The clean up of the CRSD resources failed. ##>>user resource fails to be cleaned
2012-09-11 12:04:37.042
[ohasd(16821)]CRS-2765:Resource 'ora.evmd' has failed on server 'racnode2'.
2012-09-11 12:04:37.052
[/ocw/grid/bin/scriptagent.bin(27696)]CRS-5822:Agent'/ocw/grid/bin/scriptagent_oracle' disconnected from server. Details at(:CRSAGF00117:) {0:4:10} in/ocw/grid/log/racnode2/agent/crsd/scriptagent_oracle/scriptagent_oracle.log.
2012-09-11 12:04:37.062
[ohasd(16821)]CRS-2765:Resource 'ora.crsd' has failed on server'racnode2'. ##>>node rebooted after this message, in some cases, this message won't be there
2012-09-11 12:10:47.356
[ohasd(16677)]CRS-2112:The OLR service started on node racnode2.
2012-09-11 12:10:47.521
[ohasd(16677)]CRS-1301:Oracle High Availability Service started on noderacnode2.
2012-09-11 12:10:47.539
[ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2,component: cssagent, with time stamp: L-2012-09-11-12:04:37.140 ##>>reboot advisory shows both cssdagent and cssdmonitor took the action to reboot
[ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: needto reboot, unexpected failure 8 received from CSS
2012-09-11 12:10:47.594
[ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component:cssmonit, with time stamp: L-2012-09-11-12:04:37.139
[ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need toreboot, unexpected failure 8 received from CSS
2012-09-11 12:10:47.605
[ohasd(16677)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory logfiles, 2 were announced and 0 errors occurred
当资源无法停止时,cssdagent或cssdmonitor或两者都将尝试重新引导节点,以下是样本日志。
<GI_HOME>/agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log
2012-09-11 12:04:36.400: [ USRTHRD][1095805248]clsnpollmsg_main: got posted
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdowninitiated by CSS, requested to sync
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting workerthread
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting checkloop
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal
2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: callingsync
2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: synccompleted
2012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnectfrom the server while waiting for message type 22
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnectfrom the server while waiting for message type 27
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting workerthread
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting checkloop
2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection notvalid for send operation endp 0x8e3e60 [00000000000001b7] { gipcEndpoint :localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=3165a05b-7e7139a5-18801))',remoteAddr'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=7e7139a5-3165a05b-18834))',numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0,pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost(12)
2012-09-11 12:04:37.035: [ USRTHRD][1097382208] clsnwork_process_work: callingsync
2012-09-11 12:04:37.035: [ CSSCLNT][1077418304]clsssRecvMsg: got a disconnectfrom the server while waiting for message type 1
2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: communications failed (0/3/-1)
2012-09-11 12:04:37.036: [CSSCLNT][1077418304]clssgsGroupGetStatus: returning 8
2012-09-11 12:04:37.036: [ USRTHRD][1077418304]clsnomon_status: Communications failure with CSS detected. Waiting for sync tocomplete...
2012-09-11 12:04:37.036: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC :clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed tosend on endp 0x8e3e60 [00000000000001b7] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=3165a05b-7e7139a5-18801))',remoteAddr'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=7e7139a5-3165a05b-18834))',numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0,pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf0x4180bd80, len 80, flags 0x8000000
2012-09-11 12:04:37.036: [ CSSCLNT][1098959168]clsssServerRPC: send failed witherr 12, msg type 7
2012-09-11 12:04:37.036: [CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 3
2012-09-11 12:04:37.139: [ USRTHRD][1097382208]clsnwork_process_work: sync completed
2012-09-11 12:04:37.139: [ USRTHRD][1097382208] clsnSyncComplete: posting omon
<GI_HOME>/agent/ohasd/oracssdagent_root/oracssdagent_root.log
2012-09-11 12:04:36.400: [ USRTHRD][1095805248]clsnpollmsg_main: got posted
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdowninitiated by CSS, requested to sync
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting workerthread
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting checkloop
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal
2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: callingsync
2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: synccompleted
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnectfrom the server while waiting for message type 27
2012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnectfrom the server while waiting for message type 22
2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection notvalid for send operation endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint :localAddr'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))',remoteAddr'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))',numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0,pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost(12)
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting workerthread
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting checkloop
2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC :clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed tosend on endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))',remoteAddr'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))',numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0,pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf0x4180bd80, len 80, flags 0x8000000
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssServerRPC: send failed witherr 12, msg type 7
2012-09-11 12:04:37.035: [CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 3
2012-09-11 12:04:37.036: [CSSCLNT][1077418304]clsssRecvMsg: got a disconnect from the server whilewaiting for message type 1
2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: communications failed (0/3/-1)
2012-09-11 12:04:37.036: [CSSCLNT][1077418304]clssgsGroupGetStatus: returning 8
2012-09-11 12:04:37.036: [ USRTHRD][1077418304]clsnomon_status: Communications failure with CSS detected. Waiting for sync tocomplete...
2012-09-11 12:04:37.036: [ USRTHRD][1097382208] clsnwork_process_work: callingsync
由于CRSD资源(用户资源)无法停止,crsd.log可以作为进一步调试的起点。