故障报告
故障现象
测试数据库磁盘100%,将其日志删除之后,发现Oracle develop SQL应用程序连接的实例节点2仍然能够正常执行查询语句,但是数据库集群状态异常。
使用grid用户执行:
[root@node1 bin]# su - grid
[grid@node1 ~]$ cd /u01/app/11.2.0/grid_1/bin
[grid@node1 bin]$ pwd
/u01/app/11.2.0/grid_1/bin
[grid@node1 bin]$ ./crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
使用root用户执行:
[root@node1 bin]# ./crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@node1 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
OHAS、CSS、EVM正常,CRS异常。
定位原因
检查CRS日志文件
$GRID_HOME/log/<nodename>/crsd/crsd.log
37337 2016-08-19 22:21:56.292: [UiServer][3380524800] CS(0x7fca90016ef0)set Properties ( root,0x7fcaac15b640)
37338 2016-08-19 22:21:56.303: [UiServer][3382626048]{1:35321:2977} Sending message to PE. ctx= 0x7fca94008b60, Client PID: 7985
37339 2016-08-19 22:21:56.303: [ CRSPE][3384727296]{1:35321:2977} Processing PE command id=133976. Description: [Stat Resource : 0x7fca880c04a 0]
37340 2016-08-19 22:21:56.304: [UiServer][3382626048]{1:35321:2977} Done for ctx=0x7fca94008b60
37341 2016-08-19 22:24:47.808: [ CRSPE][3384727296]{0:1:5} State change received from node2 for ora.asm node2 1
37342 2016-08-19 22:24:47.840: [ CRSPE][3384727296]{0:1:5} Processing PE command id=8072. Description: [Resource State Change (ora.asm node2 1 ) : 0x7fca880c4a60]
37343 2016-08-19 22:24:47.997: [ CRSPE][3384727296]{0:1:5} State information for [ora.asm node2 1] has been lost, all we know is the initial c heck timed out. Issuing check operations until we can operate on better data.
37344 2016-08-19 22:24:48.722: [ CRSPE][3384727296]{0:1:5} State information for [ora.asm node2 1] is still bad. Issuing another check.
ora.asm进程状态信息异常。
检查日志文件
[grid@node1 crsd]$ vi crsd.log
Linux-x86_64 Error: 28: No space left on device
Additional information: 9925
2016-08-20 04:55:07.052: [ OCRASM][2829412128]proprasmo: kgfoCheckMount returned [7]
2016-08-20 04:55:07.052: [ OCRASM][2829412128]proprasmo: The ASM instance is down
2016-08-20 04:55:07.053: [ OCRRAW][2829412128]proprioo: Failed to open [+DATA]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2016-08-20 04:55:07.053: [ OCRRAW][2829412128]proprioo: No OCR/OLR devices are usable
2016-08-20 04:55:07.053: [ OCRASM][2829412128]proprasmcl: asmhandle is NULL
2016-08-20 04:55:07.054: [ GIPC][2829412128] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5343]
2016-08-20 04:55:07.057: [ default][2829412128]clsvactversion:4: Retrieving Active Version from local storage.
2016-08-20 04:55:07.062: [ OCRRAW][2829412128]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2016-08-20 04:55:07.066: [ OCRRAW][2829412128]proprinit: Could not open raw device
2016-08-20 04:55:07.066: [ OCRASM][2829412128]proprasmcl: asmhandle is NULL
2016-08-20 04:55:07.068: [ OCRAPI][2829412128]a_init:16!: Backend init unsuccessful : [26]
2016-08-20 04:55:07.068: [ CRSOCR][2829412128] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
ORA-09925: Unable to create audit trail file
Linux-x86_64 Error: 28: No space left on device
Additional information: 9925
2016-08-20 04:55:07.069: [ CRSD][2829412128] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
ORA-09925: Unable to create audit trail file
Linux-x86_64 Error: 28: No space left on device
Additional information: 9925
2016-08-20 04:55:07.069: [ CRSD][2829412128][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-08-20 04:55:07.069: [ CRSD][2829412128] Done.
由于无可用存储空间,无法创建audit trail file导致CRS无法初始化OCR。
解决方法
在两个节点上,检查OCR磁盘组的磁盘头。
使用root用户,切换到CRS_HOME/bin目录。
[root@node2 bin]# ./kfed read /dev/sdc1
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 2147483648 ; 0x008: disk=0
kfbh.check: 1828899572 ; 0x00c: 0x6d02caf4
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISKOCR ; 0x000: length=11
kfdhdb.driver.reserved[0]: 5391183 ; 0x008: 0x0052434f
kfdhdb.driver.reserved[1]: 0 ; 0x00c: 0x00000000
kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000
kfdhdb.compat: 186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum: 0 ; 0x024: 0x0000
kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname: OCR ; 0x028: length=3
kfdhdb.grpname: DATA ; 0x048: length=4
kfdhdb.fgname: OCR ; 0x068: length=3
kfdhdb.capname: ; 0x088: length=0
kfdhdb.crestmp.hi: 33036370 ; 0x0a8: HOUR=0x12 DAYS=0x2 MNTH=0x6 YEAR=0x7e0
kfdhdb.crestmp.lo: 2945546240 ; 0x0ac: USEC=0x0 MSEC=0x5e SECS=0x39 MINS=0x2b
kfdhdb.mntstmp.hi: 33036370 ; 0x0b0: HOUR=0x12 DAYS=0x2 MNTH=0x6 YEAR=0x7e0
kfdhdb.mntstmp.lo: 3117557760 ; 0x0b4: USEC=0x0 MSEC=0x8a SECS=0x1d MINS=0x2e
kfdhdb.secsize: 512 ; 0x0b8: 0x0200
kfdhdb.blksize: 4096 ; 0x0ba: 0x1000
kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80
kfdhdb.dsksize: 2047 ; 0x0c4: 0x000007ff
kfdhdb.pmcnt: 2 ; 0x0c8: 0x00000002
kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001
kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002
kfdhdb.f1b1locn: 2 ; 0x0d4: 0x00000002
kfdhdb.redomirrors[0]: 0 ; 0x0d8: 0x0000
kfdhdb.redomirrors[1]: 0 ; 0x0da: 0x0000
kfdhdb.redomirrors[2]: 0 ; 0x0dc: 0x0000
kfdhdb.redomirrors[3]: 0 ; 0x0de: 0x0000
kfdhdb.dbcompat: 168820736 ; 0x0e0: 0x0a100000
kfdhdb.grpstmp.hi: 33036370 ; 0x0e4: HOUR=0x12 DAYS=0x2 MNTH=0x6 YEAR=0x7e0
kfdhdb.grpstmp.lo: 2945359872 ; 0x0e8: USEC=0x0 MSEC=0x3a8 SECS=0x38 MINS=0x2b
kfdhdb.vfstart: 352 ; 0x0ec: 0x00000160
kfdhdb.vfend: 384 ; 0x0f0: 0x00000180
kfdhdb.spfile: 58 ; 0x0f4: 0x0000003a
kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001
…….
kfdhdb.ub4spare[0]: 0 ; 0x0fc: 0x00000000
kfdhdb.acdb.ub2spare: 0 ; 0x1de: 0x0000
磁盘头正常。
直接重新启动CRS
root用户执行如下命令:
# /opt/grid/product/11.2.0/grid_1/bin/crsctl stop crs
如果命令运行失败,使用-f参数。
[root@node1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'node1'
CRS-2673: Attempting to stop 'ora.evmd' on 'node1'
CRS-2673: Attempting to stop 'ora.asm' on 'node1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1'
CRS-2677: Stop of 'ora.evmd' on 'node1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded
CRS-2677: Stop of 'ora.asm' on 'node1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'node1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'node1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'node1'
CRS-2677: Stop of 'ora.cssd' on 'node1' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'node1'
CRS-2677: Stop of 'ora.crf' on 'node1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'node1'
CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1'
CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@node1 bin]# ./crsctl status res -t
---------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS Local Resources
---------------------------------------------------------------------------
ora.DATA.dg ONLINE ONLINE node1
ora.DATA1.dg ONLINE ONLINE node1
ora.FRA.dg ONLINE ONLINE node1
ora.LISTENER.lsnr ONLINE ONLINE node1
ora.asm ONLINE ONLINE node1 Started
ora.gsd OFFLINE OFFLINE node1
ora.net1.network ONLINE ONLINE node1
ora.ons ONLINE ONLINE node1
---------------------------------------------------------------------------
Cluster Resources
---------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE node1
ora.cvu 1 ONLINE ONLINE node1
ora.node1.vip 1 ONLINE ONLINE node1
ora.node2.vip 1 ONLINE OFFLINE
ora.oc4j 1 ONLINE ONLINE node1
ora.scan1.vip 1 ONLINE ONLINE node1
ora.vmtest.db 1 ONLINE OFFLINE
2 ONLINE OFFLINE
[root@node1 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
等待数据库实例启动之后,重新运行命令,确认实例状态open即可.
参考官方文档:文档 ID 1095214.1
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/31142205/viewspace-2124849/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/31142205/viewspace-2124849/