11.2 RAC自动启动报错诊断

最新推荐文章于 2022-10-11 21:25:06 发布

ckawt40802

最新推荐文章于 2022-10-11 21:25:06 发布

阅读量587

点赞数

文章标签：数据库操作系统

一个客户的11.2 RAC for Linux X86-64环境停电后出现了故障，RAC环境无法自动启动。

登录数据库服务器后，发现CLUSTER进程存在，于是尝试手工启动数据库：

$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Sun May 29 20:17:16 2011

Connected to an idle instance.

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DG_DATA/SPSP/spfileSPSP.ora'
ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/SPSP/spfileSPSP.ora
ORA-15056: additional error message
ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/spsp/spfilespsp.ora
ORA-15001: diskgroup "DG_DATA" does not exist or is not mounted
ORA-06512: at line 4

看来问题出在ASM磁盘组上，检查ASM磁盘组的告警日志：

NOTE: client SPSP2:SPSP registered, osid 14319, mbr 0x1
Sun May 29 16:58:31 2011
WARNING: Write Failed. group:1 disk:0 AU:12 offset:856064 size:4096
NOTE: unable to write any mirror side for diskgroup DG_ARCH
NOTE: cache initiating offline of disk 0 group DG_ARCH
NOTE: process 13597 initiating offline of disk 0.3915929597 (ARCH1) with mask 0x7e in group 1
Sun May 29 16:58:31 2011
WARNING: Disk ARCH1 in mode 0x7f is now being offlined
WARNING: Disk ARCH1 in mode 0x7f is now being taken offline
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96857fd, mode = 0x15
GMON updating disk modes for group 1 at 21 for pid 15, osid 13597
ERROR: too many offline disks in PST (grp 1)
Sun May 29 16:58:31 2011
NOTE: cache dismounting (not clean) group 1/0x9C08A705 (DG_ARCH)
WARNING: Disk ARCH1 in mode 0x7f offline aborted
WARNING: Offline for disk ARCH1 in mode 0x7f failed.
Sun May 29 16:58:31 2011
NOTE: halting all I/Os to diskgroup 1 (DG_ARCH)
NOTE: unable to offline disks after getting write error for diskgroup DG_ARCH
NOTE: disk 0 had IO error
NOTE: LGWR doing non-clean dismount of group 1 (DG_ARCH)
NOTE: LGWR sync ABA=237.2255 last written ABA 237.2256
Sun May 29 16:58:32 2011
ERROR: ORA-15130 in COD recovery for diskgroup 1/0x9c08a705 (DG_ARCH)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_rbal_13603.trc:
ORA-15130: diskgroup "DG_ARCH" is being dismounted
WARNING: Read Failed. group:1 disk:0 AU:1 offset:4096 size:4096
WARNING: Read Failed. group:1 disk:0 AU:1 offset:0 size:4096
ERROR: no PST quorum in group: required 1, found 0
ERROR: Could not heartbeat PST for grp 1. Force dismounting the disk group.
WARNING: Write Failed. group:2 disk:2 AU:1 offset:1044480 size:4096
WARNING: Write Failed. group:2 disk:1 AU:1 offset:1044480 size:4096
WARNING: Write Failed. group:2 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929598 (VD1) not responding to heart beat
WARNING: disk 1.3915929599 (VD2) not responding to heart beat
ERROR: too many offline disks in PST (grp 2)
WARNING: disk 2.3915929600 (VD3) not responding to heart beat
ERROR: too many offline disks in PST (grp 2)
Sun May 29 16:58:32 2011
Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_b001_14488.trc:
ORA-15130: diskgroup "" is being dismounted
WARNING: Write Failed. group:3 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929601 (DATA1) not responding to heart beat
ERROR: too many offline disks in PST (grp 3)
WARNING: Write Failed. group:4 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929603 (FLSH1) not responding to heart beat
ERROR: too many offline disks in PST (grp 4)
WARNING: Write Failed. group:5 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929604 (REDO1) not responding to heart beat
ERROR: too many offline disks in PST (grp 5)
Sun May 29 16:58:32 2011
NOTE: process 14490 initiating offline of disk 0.3915929598 (VD1) with mask 0x7e in group 2
NOTE: process 14490 initiating offline of disk 1.3915929599 (VD2) with mask 0x7e in group 2
NOTE: process 14490 initiating offline of disk 2.3915929600 (VD3) with mask 0x7e in group 2
NOTE: checking PST: grp = 2
.
.
.
freeing rdom 4
WARNING: dirty detached from domain 4
NOTE: cache dismounted group 4/0x9C28A708 (DG_FLSH)
SQL> alter diskgroup DG_FLSH dismount force /* ASM SERVER */
NOTE: cache deleting context for group DG_FLSH 4/0x9c28a708
GMON dismounting group 4 at 30 for pid 38, osid 14494
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DG_FLSH was dismounted
SUCCESS: alter diskgroup DG_FLSH dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DG_FLSH
NOTE: diskgroup resource ora.DG_FLSH.dg is offline
List of instances:
2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 2)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 5 invalid = TRUE
19 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
freeing rdom 5
WARNING: dirty detached from domain 5
NOTE: cache dismounted group 5/0x9C38A709 (DG_REDO)
SQL> alter diskgroup DG_REDO dismount force /* ASM SERVER */
NOTE: cache deleting context for group DG_REDO 5/0x9c38a709
GMON dismounting group 5 at 31 for pid 39, osid 14496
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DG_REDO was dismounted
SUCCESS: alter diskgroup DG_REDO dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DG_REDO
NOTE: diskgroup resource ora.DG_REDO.dg is offline
Sun May 29 16:58:36 2011
Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13637.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
NOTE: client exited [13627]
Sun May 29 16:58:40 2011
Shutting down instance (abort)
License high water mark = 15
USER (ospid: 13624): terminating the instance
Instance terminated by USER, pid = 13624
Sun May 29 16:58:41 2011
Instance shutdown complete

根据错误信息，ASM在尝试写磁盘组的时候出现了IO错误。

进一步检查CLUSTER的日志信息：

2011-05-29 00:00:19.569
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:19.570
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:19.678
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:22.584
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:22.584
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:22.684
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:24.818
[/oracle/product/11g/grid/bin/orarootagent.bin(13818)]CRS-5822:Agent '/oracle/product/11g/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/orarootagent_root/orarootagent_root.log.
2011-05-29 00:00:24.818
[/oracle/product/11g/grid/bin/oraagent.bin(13815)]CRS-5822:Agent '/oracle/product/11g/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:8} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/oraagent_grid/oraagent_grid.log.
2011-05-29 00:00:25.244
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:26.626
[crsd(14838)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:27.278
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:28.620
[crsd(14855)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:29.306
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:30.664
[crsd(14868)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:31.333
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:32.692
[crsd(14881)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:33.367
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
.
.
.
2011-05-29 20:14:32.155
[crsd(14562)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 20:14:32.822
[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 20:14:34.195
[crsd(14575)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 20:14:34.849
[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 20:14:34.849
[ohasd(12719)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
2011-05-29 20:15:43.700
[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 99960 milliseconds
2011-05-29 20:15:43.701
[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 99960 milliseconds
2011-05-29 20:15:44.702
[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 99100 milliseconds
2011-05-29 20:16:33.800
[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 49870 milliseconds
2011-05-29 20:16:33.801
[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 49870 milliseconds
2011-05-29 20:16:34.802
[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 49000 milliseconds
2011-05-29 20:17:03.861
[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 19950 milliseconds
2011-05-29 20:17:03.861
[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 19810 milliseconds
2011-05-29 20:17:03.861
[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 19810 milliseconds
2011-05-29 20:17:23.900
[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD1; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD2; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD3; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log
2011-05-29 20:17:23.952
[cssd(13251)]CRS-1652:Starting clean up of CRSD resources.
2011-05-29 20:17:24.133
[cssd(13251)]CRS-1653:The clean up of the CRSD resources failed.

根据错误信息，开始的时候，所有的磁盘组都可以加载到MOUNT状态，但是随后Oracle在尝试写VOT所在磁盘组时，出现了错误，导致OCR进程退出，而其他的磁盘组也因为错误陆续被OFFLINE。

而且由于无法获取VOT磁盘，很快节点被踢出CLUSTER环境，导致主机重启。

最后检查了cssd/ocssd.log信息：

2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(DAALL_DB), updateseq(5), status(0), sendresp(1)
2011-05-29 05:46:20.919: [ SKGFD][1076803904]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op ioerror error I/O Error)
2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmTestSetLastGrockUpdate: grock(GR+GCR1), updateseq(4) msgseq(5), lastupdt<0xb585af0>, ignoreseq(0)
2011-05-29 05:46:20.919: [ CSSD][1076803904](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 17 of ORCL:VD1
2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDeadProc: proc 0xb23bd10
2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDestroyProc: cleaning up proc(0xb23bd10) con(0x4d0b) skgpid 14228 ospid 14228 with 0 clients, refcount 0
2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDiscEndpcl: gipcDestroy 0x4d0b
2011-05-29 05:46:20.920: [ CSSD][1076803904]clssnmvDiskAvailabilityChange: voting file ORCL:VD1 now offline
2011-05-29 05:46:20.920: [ SKGFD][1097120064]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab01aa5b0 for disk :ORCL:VD1:

2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmRemoveMember: grock GR+GCR1, member number 3 (0x2aaab03886d0) node number 1 state 0x0 grock type 2
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(1), incarn 5, mbrc 3, to member 1, events 0x280, state 0x0
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmRPCDone: rpc 0x2aaaaaec7060 (RPC#98) state 6, flags 0x100
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmDelMemCmpl: rpc 0x2aaaaaec7060, ret 0, client (nil) member 0x2aaab03886d0
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmFreeRPCIndex: freeing rpc 98
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(GR+GCR1), updateseq(5), status(0), sendresp(1)
2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmDeadProc: proc 0x2aaab03997f0
2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmFenceClient: fencing client (0x2aaab0399960), member 0 in group DGSPSP0, no share, death fence 1, SAGE fence 0
.
.
.
2011-05-29 20:44:54.778: [ CRSMAIN][1088713120] Checking the OCR device
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Sync-up with OCR
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Connecting to the CSS Daemon
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Getting local node number
2011-05-29 20:44:54.779: [ CRSMAIN][1105021248] Policy Engine is not initialized yet!
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Initializing OCR
[ CLWAL][1088713120]clsw_Initialize: OLR initlevel [70000]
2011-05-29 20:44:55.099: [ OCRASM][1088713120]proprasmo: Error in open/create file in dg [DG_CRS]
[ OCRASM][1088713120]SLOS : SLOS: cat=8, pn=kgfoOpenFile01, dep=15056, loc=kgfokge

2011-05-29 20:44:55.099: [ OCRASM][1088713120]ASM Error Stack :
2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: kgfoCheckMount returned [6]
2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: The ASM disk group DG_CRS is not found or not mounted
2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: Failed to open [+DG_CRS]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: No OCR/OLR devices are usable
2011-05-29 20:44:55.127: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL
2011-05-29 20:44:55.127: [ GIPC][1088713120] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 687], original from [clsss.c : 5325]
2011-05-29 20:44:55.129: [ default][1088713120]clsvactversion:4: Retrieving Active Version from local storage.
2011-05-29 20:44:55.130: [ CSSCLNT][1088713120]clssgsgrppubdata: group (ocr_Oracle-cluster) not found

2011-05-29 20:44:55.130: [ OCRRAW][1088713120]proprio_repairconf: Failed to retrieve the group public data. CSS ret code [20]
2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprioo: Failed to auto repair the OCR configuration.
2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprinit: Could not open raw device
2011-05-29 20:44:55.133: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL
2011-05-29 20:44:55.134: [ OCRAPI][1088713120]a_init:16!: Backend init unsuccessful : [26]
2011-05-29 20:44:55.134: [ CRSOCR][1088713120] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

2011-05-29 20:44:55.134: [ CRSOCR][1088713120][PANIC] OCR Context is NULL(File: caaocr.cpp, line: 145)

2011-05-29 20:44:55.134: [ CRSD][1088713120][PANIC] CRSD Exiting. OCR Failed
2011-05-29 20:44:55.134: [ CRSD][1088713120] Done.

基于错误信息的判断，问题和存储层直接相关，如果不是磁盘损坏，就是多路径设置异常，导致问题的原因是cluster在读写VOT和OCR的ASM是出现了错误，问题应该发生在组成这个ASM磁盘组的裸设备上。

最终客户通过重建OCR和VOT解决了这个错误。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/4227/viewspace-696728/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/4227/viewspace-696728/