【ORACLE】RAC 磁盘超时,导致数据库重启 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.

Oracle RAC 11.2.0.4部署在云上的两节点数据库频繁重启,经检查发现ASM日志中出现关于PST心跳延迟警告。问题源于磁盘组在15秒内未响应ASM实例导致dismount。解决方案是将隐藏参数_asm_hbeatiowait调整到120秒,并重启CRS和数据库。该问题反映出云资源共享磁盘可能存在的不稳定性和虚拟化环境对高可用性的影响。建议避免在虚拟化环境中部署Oracle RAC。
摘要由CSDN通过智能技术生成

项目场景:

采用云资源上部署的oracle RAC 11.2.0.4数据库两节点不定期重启


问题描述

现场反馈,数据库两节点不断重启,检查crs,无重大报错。检查asm日志,发现如下报错。

Fri Sep 09 10:32:50 2022
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
Fri Sep 09 10:33:13 2022
NOTE: client exited [2319]
Fri Sep 09 10:33:13 2022
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Fri Sep 09 10:33:13 2022
PMON (ospid: 2262): terminating the instance due to error 481
Fri Sep 09 10:33:14 2022
ORA-1092 : opitsk aborting process
Fri Sep 09 10:33:14 2022
License high water mark = 19
Instance terminated by PMON, pid = 2262
USER (ospid: 8682): terminating the instance
Instance terminated by USER, pid = 8682

原因分析:

经过查询oracle官方有关于此问题说明
ASM diskgroup dismount with “Waited 15 secs for write IO to PST” (文档 ID 1581684.1)
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,causes the affected disks to go offline.By default, it is 15 seconds.
Diskgroup will get dismounted if ASM cannot issue the PST heart beat to majority of the PST copies in a diskgroup with respect to redundancy.
i.e. Normal redundancy diskgroup will get dismounted if it failed to update two of the copies.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,but the heart beat delays do not dismount external redundancy diskgroup directly.
The ASM disk could go into unresponsiveness, normally in the following scenarios:

  • Some of the paths of the physical paths of the multipath device are offline or lost
  • During path ‘failover’ in a multipath set up
  • Server load, or any sort of storage/multipath/OS maintenance
    The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then _asm_hbeatiowait helps in setting the time out.

上面描述,可以理解为下面几点:

  1. ASM实例会定期检查每一个磁盘组的磁盘状态,是否通信正常;
  2. 这个检查,只是针对normal和high冗余模式,对于external冗余,不会遇到这个错误;
  3. 默认情况是15s超时,也就是说15s磁盘组还是没有对ASM实例响应的话,就会dismount磁盘组。
  4. 此次部署使用云资源共享磁盘,仅一个磁盘,会导致数据库ASM磁盘组宕机。

解决方案:

根据oracle建议,将_asm_hbeatiowait时间调整为120S。

#查看当前_asm_hbeatiowait时间
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value;
_asm_hbeatiowait 15
_asm_hbeatwaitquantum 2

#修改_asm_hbeatiowait时间为120S
SQL> alter system set "_asm_hbeatiowait"=120 scope=spfile;

#重启CRS和数据库

更改后,观察运行状况,无报错。
建议:
不推荐在虚拟化环境安装oracle rac。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值