ora-00494引起rac宕机的分析处理

最新推荐文章于 2022-09-18 17:38:43 发布

coujiongyong0208

最新推荐文章于 2022-09-18 17:38:43 发布

阅读量303

点赞数

文章标签：数据库

1、问题分析
12月1日早上8:20，客户报障说数据库无法连接上，使用sqlplus连接也无法连上。我本人，通过远程工具连接上去，发现数据库好的，并且检查集群如下：
[oracle@myrac1 ~]$ crs_stat -t

长时间没有响应

[oracle@myrac1 cssd]$ ps -ef | grep pmon
oracle    3308 19190 0 11:41 pts/2    00:00:00 grep pmon
oracle   18254     1 0 09:35 ?        00:00:00 asm_pmon_+ASM1
oracle   18474     1 0 09:36 ?        00:00:00 ora_pmon_testdb1

但是进入数据库，发现根据不可用

[oracle@myrac1 ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 10.2.0.5.0 - Production on Fri Mar 7 10:25:15 2014

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP,
Data Mining Scoring Engine and Real Application Testing options

SQL> select status from v$instance;

一直无法响应。这种状态，说明数据库已经无法使用！

检查alert日志发现：
从6点24开始，系统出现报错
Attempt to get Control File Enqueue by LGWR pid=17130 (mode=X, type=0, timeout=9
00) is being blocked by inst=2, pid=11498
Please check inst 2's alert log for more information on the blocker including a
possible ORA-00494 and related incident logs

出现数据库ora-00494

[oracle@myrac1 cssd]$ oerr ora 494
00494, 00000, "enqueue%s held for too long%s by '%s'"
// *Cause: The specified process did not release the enqueue within
// the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
// Services with the incident information.

从这种报障情况来看，是由于集群的第二个节点对控制文件进行了锁定，并且产生了阻塞超过了900秒。第一节点无法得到资源，主动查杀了后台进程！而在第二个节点，这段时间没有任何日志！接到报错后，第二节点可以ping通，但无法连接，到机房接上显示屏、键盘后，没任任何显示，说明系统已经挂死
在第一个节点，就出现要求instances离开
Waiting for instances to leave:

Waiting for instances to leave:

所以，最后的结论来看，应该是第二节点在系统挂死前，已经持有cf锁，而第一节点无法申请到，进而引起宕机。

最后，我们没有办法，只好重新启动第二个节点，对第一个节点进行强制关闭数据库

SQL> shutdown abort;

2、解决办法

    为避免因为某一个节点的原因，引数据库集群宕机，oracle官方解决方案
    CF eq超过900秒，会报ORA-00494,10.2.0.4对ORA-00494引进一种新的处理机制，当出现这个错误时，不管是不是后台进程，只要是阻塞的，都会kill掉。因为cf 锁的一直存在，lgwr因为需要申请cf锁会一直等待，此时active session中有越来越多的log file sync，当cf eq超过900s，报ORA-00494的时候，先kill了ckpt，然后 lgwr也被kill，ckpt也被 kill掉，再在 11:57:35 将lgwr也kill了。同时在11:57:35 时，实例也crash了。设置_kill_controlfile_enqueue_blocker=false参数，可以不kill掉任何进程。（对于CF eq超过900s也不会处理）。
    如果在init.ora中设置_kill_enqueue_blocker=1 ，可以阻止kill后台进程，但是仍旧kill非后台的进程。出现这种问题的原因应该去找，为什么CF EQ会超过900s
SQL> select ksppinm,ksppstvl,ksppstdf from x$ksppi a,x$ksppcv b where a.indx = b.indx and ksppinm='_kill_controlfile_enqueue_blocker';

KSPPINM                                   KSPPSTVL
----------------------------------------- ------
_kill_controlfile_enqueue_blocker         TRUE

建议将这个值调整为false

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/29371470/viewspace-1102951/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/29371470/viewspace-1102951/