故障处理-数据库本地磁盘100%,之后集群状态异常

最新推荐文章于 2021-02-19 18:00:00 发布

columsms27884

最新推荐文章于 2021-02-19 18:00:00 发布

阅读量272

点赞数

文章标签：数据库

故障报告

故障现象

测试数据库磁盘100%，将其日志删除之后，发现Oracle develop SQL应用程序连接的实例节点2仍然能够正常执行查询语句，但是数据库集群状态异常。

使用grid用户执行：

[root@node1 bin]# su - grid

[grid@node1 ~]$ cd /u01/app/11.2.0/grid_1/bin

[grid@node1 bin]$ pwd

/u01/app/11.2.0/grid_1/bin

[grid@node1 bin]$ ./crsctl status res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

使用root用户执行：

[root@node1 bin]# ./crsctl status res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

[root@node1 bin]# ./crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

OHAS、CSS、EVM正常，CRS异常。

定位原因

检查CRS日志文件

$GRID_HOME/log/<nodename>/crsd/crsd.log

37337 2016-08-19 22:21:56.292: [UiServer][3380524800] CS(0x7fca90016ef0)set Properties ( root,0x7fcaac15b640)

37338 2016-08-19 22:21:56.303: [UiServer][3382626048]{1:35321:2977} Sending message to PE. ctx= 0x7fca94008b60, Client PID: 7985

37339 2016-08-19 22:21:56.303: [ CRSPE][3384727296]{1:35321:2977} Processing PE command id=133976. Description: [Stat Resource : 0x7fca880c04a 0]

37340 2016-08-19 22:21:56.304: [UiServer][3382626048]{1:35321:2977} Done for ctx=0x7fca94008b60

37341 2016-08-19 22:24:47.808: [ CRSPE][3384727296]{0:1:5} State change received from node2 for ora.asm node2 1

37342 2016-08-19 22:24:47.840: [ CRSPE][3384727296]{0:1:5} Processing PE command id=8072. Description: [Resource State Change (ora.asm node2 1 ) : 0x7fca880c4a60]

37343 2016-08-19 22:24:47.997: [ CRSPE][3384727296]{0:1:5} State information for [ora.asm node2 1] has been lost, all we know is the initial c heck timed out. Issuing check operations until we can operate on better data.

37344 2016-08-19 22:24:48.722: [ CRSPE][3384727296]{0:1:5} State information for [ora.asm node2 1] is still bad. Issuing another check.

ora.asm进程状态信息异常。

检查日志文件

[grid@node1 crsd]$ vi crsd.log

Linux-x86_64 Error: 28: No space left on device

Additional information: 9925

2016-08-20 04:55:07.052: [ OCRASM][2829412128]proprasmo: kgfoCheckMount returned [7]

2016-08-20 04:55:07.052: [ OCRASM][2829412128]proprasmo: The ASM instance is down

2016-08-20 04:55:07.053: [ OCRRAW][2829412128]proprioo: Failed to open [+DATA]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.

2016-08-20 04:55:07.053: [ OCRRAW][2829412128]proprioo: No OCR/OLR devices are usable

2016-08-20 04:55:07.053: [ OCRASM][2829412128]proprasmcl: asmhandle is NULL

2016-08-20 04:55:07.054: [ GIPC][2829412128] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5343]

2016-08-20 04:55:07.057: [ default][2829412128]clsvactversion:4: Retrieving Active Version from local storage.

2016-08-20 04:55:07.062: [ OCRRAW][2829412128]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.

2016-08-20 04:55:07.066: [ OCRRAW][2829412128]proprinit: Could not open raw device

2016-08-20 04:55:07.066: [ OCRASM][2829412128]proprasmcl: asmhandle is NULL

2016-08-20 04:55:07.068: [ OCRAPI][2829412128]a_init:16!: Backend init unsuccessful : [26]

2016-08-20 04:55:07.068: [ CRSOCR][2829412128] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

ORA-09925: Unable to create audit trail file

Linux-x86_64 Error: 28: No space left on device

Additional information: 9925

2016-08-20 04:55:07.069: [ CRSD][2829412128] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage

ORA-09925: Unable to create audit trail file

Linux-x86_64 Error: 28: No space left on device

Additional information: 9925

2016-08-20 04:55:07.069: [ CRSD][2829412128][PANIC] CRSD exiting: Could not init OCR, code: 26

2016-08-20 04:55:07.069: [ CRSD][2829412128] Done.

由于无可用存储空间，无法创建audit trail file导致CRS无法初始化OCR。

解决方法

在两个节点上，检查OCR磁盘组的磁盘头。

使用root用户，切换到CRS_HOME/bin目录。

[root@node2 bin]# ./kfed read /dev/sdc1

kfbh.endian: 1 ; 0x000: 0x01

kfbh.hard: 130 ; 0x001: 0x82

kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD

kfbh.datfmt: 1 ; 0x003: 0x01

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 2147483648 ; 0x008: disk=0

kfbh.check: 1828899572 ; 0x00c: 0x6d02caf4

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

kfdhdb.driver.provstr: ORCLDISKOCR ; 0x000: length=11

kfdhdb.driver.reserved[0]: 5391183 ; 0x008: 0x0052434f

kfdhdb.driver.reserved[1]: 0 ; 0x00c: 0x00000000

kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000

kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000

kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000

kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000

kfdhdb.compat: 186646528 ; 0x020: 0x0b200000

kfdhdb.dsknum: 0 ; 0x024: 0x0000

kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL

kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER

kfdhdb.dskname: OCR ; 0x028: length=3

kfdhdb.grpname: DATA ; 0x048: length=4

kfdhdb.fgname: OCR ; 0x068: length=3

kfdhdb.capname: ; 0x088: length=0

kfdhdb.crestmp.hi: 33036370 ; 0x0a8: HOUR=0x12 DAYS=0x2 MNTH=0x6 YEAR=0x7e0

kfdhdb.crestmp.lo: 2945546240 ; 0x0ac: USEC=0x0 MSEC=0x5e SECS=0x39 MINS=0x2b

kfdhdb.mntstmp.hi: 33036370 ; 0x0b0: HOUR=0x12 DAYS=0x2 MNTH=0x6 YEAR=0x7e0

kfdhdb.mntstmp.lo: 3117557760 ; 0x0b4: USEC=0x0 MSEC=0x8a SECS=0x1d MINS=0x2e

kfdhdb.secsize: 512 ; 0x0b8: 0x0200

kfdhdb.blksize: 4096 ; 0x0ba: 0x1000

kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000

kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

kfdhdb.dsksize: 2047 ; 0x0c4: 0x000007ff

kfdhdb.pmcnt: 2 ; 0x0c8: 0x00000002

kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001

kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002

kfdhdb.f1b1locn: 2 ; 0x0d4: 0x00000002

kfdhdb.redomirrors[0]: 0 ; 0x0d8: 0x0000

kfdhdb.redomirrors[1]: 0 ; 0x0da: 0x0000

kfdhdb.redomirrors[2]: 0 ; 0x0dc: 0x0000

kfdhdb.redomirrors[3]: 0 ; 0x0de: 0x0000

kfdhdb.dbcompat: 168820736 ; 0x0e0: 0x0a100000

kfdhdb.grpstmp.hi: 33036370 ; 0x0e4: HOUR=0x12 DAYS=0x2 MNTH=0x6 YEAR=0x7e0

kfdhdb.grpstmp.lo: 2945359872 ; 0x0e8: USEC=0x0 MSEC=0x3a8 SECS=0x38 MINS=0x2b

kfdhdb.vfstart: 352 ; 0x0ec: 0x00000160

kfdhdb.vfend: 384 ; 0x0f0: 0x00000180

kfdhdb.spfile: 58 ; 0x0f4: 0x0000003a

kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001

…….

kfdhdb.ub4spare[0]: 0 ; 0x0fc: 0x00000000

kfdhdb.acdb.ub2spare: 0 ; 0x1de: 0x0000

磁盘头正常。

直接重新启动CRS

root用户执行如下命令：

# /opt/grid/product/11.2.0/grid_1/bin/crsctl stop crs

如果命令运行失败，使用-f参数。

[root@node1 bin]# ./crsctl stop crs -f

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1'

CRS-2673: Attempting to stop 'ora.ctssd' on 'node1'

CRS-2673: Attempting to stop 'ora.evmd' on 'node1'

CRS-2673: Attempting to stop 'ora.asm' on 'node1'

CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1'

CRS-2677: Stop of 'ora.evmd' on 'node1' succeeded

CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded

CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded

CRS-2677: Stop of 'ora.asm' on 'node1' succeeded

CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'node1'

CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'node1' succeeded

CRS-2673: Attempting to stop 'ora.cssd' on 'node1'

CRS-2677: Stop of 'ora.cssd' on 'node1' succeeded

CRS-2673: Attempting to stop 'ora.crf' on 'node1'

CRS-2677: Stop of 'ora.crf' on 'node1' succeeded

CRS-2673: Attempting to stop 'ora.gipcd' on 'node1'

CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded

CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1'

CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed

CRS-4133: Oracle High Availability Services has been stopped.

[root@node1 bin]# ./crsctl status res -t

---------------------------------------------------------------------------

NAME TARGET STATE SERVER STATE_DETAILS Local Resources

---------------------------------------------------------------------------

ora.DATA.dg ONLINE ONLINE node1

ora.DATA1.dg ONLINE ONLINE node1

ora.FRA.dg ONLINE ONLINE node1

ora.LISTENER.lsnr ONLINE ONLINE node1

ora.asm ONLINE ONLINE node1 Started

ora.gsd OFFLINE OFFLINE node1

ora.net1.network ONLINE ONLINE node1

ora.ons ONLINE ONLINE node1

---------------------------------------------------------------------------

Cluster Resources

---------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE node1

ora.cvu 1 ONLINE ONLINE node1

ora.node1.vip 1 ONLINE ONLINE node1

ora.node2.vip 1 ONLINE OFFLINE

ora.oc4j 1 ONLINE ONLINE node1

ora.scan1.vip 1 ONLINE ONLINE node1

ora.vmtest.db 1 ONLINE OFFLINE

2 ONLINE OFFLINE

[root@node1 bin]# ./crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

等待数据库实例启动之后，重新运行命令,确认实例状态open即可.

参考官方文档:文档 ID 1095214.1

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/31142205/viewspace-2124849/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/31142205/viewspace-2124849/

columsms27884

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
故障处理-数据库本地磁盘100%,之后集群状态异常

故障报告故障现象测试数据库磁盘100%，将其日志删除之后，发现Oracle develop SQL应用程序连接的实例节点2仍然能够正常执行查询语句，但是数据库集群状态异常。...
复制链接

扫一扫