oracle hbeatiowait,CRS磁盘force dismount引起的RAC节点宕机故障

数据技术嘉年华,十周年盛大开启,点我立即报名!大会以“自研·智能·新基建——云和数据促创新 生态融合新十年” 为主题,相邀数据英雄,总结过往十年历程与成绩,展望未来十年趋势与目标!近60场演讲,大咖云集,李飞飞、苏光牛、林晓斌、黄东旭...,快来pick你喜欢的嘉宾主题吧!

问题描述

1、环境介绍Oracle RAC 11.2.0.4.0RedHat Linux 6.9

2、告警信息

巡检的时候执行如下命令不成功$ crsctl stat res -tCRS-0184: Cannot communicate with the CRS daemon.

3、检查CRS状态$ crsctl check clusterCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is online

4、启动CRS服务[root@node1 ~]# /app/grid/bin/crsctl start crsCRS-4640: Oracle High Availability Services is already activeCRS-4000: Command Start failed, or completed with errors.

启动失败。

5、检查CRS日志[ OCRASM][33715952]proprasmo: The ASM disk group crs is not found or not mounted[ OCRRAW][33715952]proprioo: Failed to open [+crs]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.[ OCRRAW][33715952]proprioo: No OCR/OLR devices are usable[ OCRASM][33715952]proprasmcl: asmhandle is NULL[ GIPC][33715952] gipcCheckInitialization: possible incompatiblenon-threaded init from [prom.c : 690], original from [clsss.c : 5343][ default][33715952]clsvactversion:4: Retrieving Active Version from local storage.[ OCRRAW][33715952]proprrepauto: The local OCRconfiguration matches with the configuration published by OCR CacheWriter. No repair required.[ OCRRAW][33715952]proprinit: Could not open raw device[ OCRASM][33715952]proprasmcl: asmhandle is NULL[ OCRAPI][33715952]a_init:16!: Backend init unsuccessful : [26][ CRSOCR][33715952] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

发现磁盘组有问题。

6、检查磁盘组状态SQL> set linesize 200SQL> select GROUP_NUMBER,NAME,TYPE,ALLOCATION_UNIT_SIZE,STATE from v$asm_diskgroup;GROUP_NUMBER NAME TYPE ALLOCATION_UNIT_SIZE STATE 0 CRS 0 DISMOUNTED2 DATA1 EXTERN 4194304 MOUNTED

发现CRS磁盘组未挂载。

7、检查ASM日志SQL> show parameter dump

日志报错如下:WARNING: Waited 19 secs for write IO to PST disk 0 in group 1.WARNING: Waited 19 secs for write IO to PST disk 0 in group 1.WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.WARNING: Waited 20 secs for write IO to PST disk 0 in group 4.WARNING: Waited 20 secs for write IO to PST disk 0 in group 4.Fri Jul 07 02:15:03 2017WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.WARNING: Waited 15 secs for write IO to PST disk 0 in group 4.WARNING: Waited 15 secs for write IO to PST disk 0 in group 4.SQL> alter diskgroup CRS dismount force /* ASM SERVER:375140205 */

8、挂载磁盘组sqlplus / as sysasmSQL> alter diskgroup crs mount;

9、启动CRS进程[root@node1 ~]# /app/grid/bin/crsctl start res ora.crsd -initCRS-2672: Attempting to start ‘ora.crsd’ on ‘node1’CRS-2676: Start of ‘ora.crsd’ on ‘node1’ succeeded

原因分析

经查看集群相关日志可以确定,由于存储磁盘出现IO问题(或光线闪断、或IO延迟),导致集群CRS异常宕机。但是,比较奇怪的是,虽然CSR掉线了,ASM实例和DB实例却好好的,还可以正常使用。查询oracle

support发现一篇文章1581864.1 提到ASM

CRS仲裁盘访问超时与隐藏参数_asm_hbeatiowait有关系,而ASM的隐藏参数_asm_hbeatiowait由于操作系统多路径Multipath配置的polling_interval有关,具体的故障原因是操作系统盘的判断访问超时远大于数据库ASM仲裁盘访问超时,导致ORACLE

RAC判定ASM中仲裁盘无法访问从而将仲裁盘强制Offline。解决的思路是:首先,确定操作系统polling_interval参数与数据库ASM隐藏参数值_asm_hbeatiowait,将_asm_hbeatiowait的值调整到比polling_interval值大即可。

解决办法

看数据库RAC ASM的_asm_hbeatiowait值(默认是15秒):SQL> SELECT ksppinm, ksppstvl, ksppdescFROM xksppix,xksppix,xksppcv yWHERE x.indx = y.indx AND ksppinm = ‘_asm_hbeatiowait’KSPPINM KSPPSTVL KSPPDESC_asm_hbeatiowait 15 number of secs to wait for PST Async Hbeat IO return

查看操作存储盘访问超时时间(RHEL6.8默认是30秒)[root@rac1 ~]# cat /sys/block/sdb/device/timeout30[root@rac1 ~]# cat /etc/redhat-releaseRed Hat Enterprise Linux Server release 6.8 (Santiago)

将_asm_hbeatiowait 的值调整为45秒(该参数是静态参数,需要重启集群)SQL> alter system set “_asm_hbeatiowait”=45 scope=spfile sid=’*’;System altered.

重启集群并重启服务器

无论是root还是grid重启crs服务都没能成功,于是决定重启服务器,root先使用crsctl stop crs -f关闭集群服务,然后执行rboot,还算顺利,服务器重启。

墨天轮原文链接:https://www.modb.pro/db/33361(复制到浏览器中打开或者点击“阅读原文”立即查看)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值