天萃荷净
Oracle研究中心案例分析:运维DBA反映Oracle数据库出现异常并报错database crash with ora-00494,分析原因为CF锁异常导致。
本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: database crash with ora-00494
database crash with ora-00494
###### alert 如下 ######
Thu Mar 15 22:59:38 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
Thu Mar 15 22:59:40 2012
System State dumped to trace file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc
Thu Mar 15 22:59:41 2012
Thread 1 cannot allocate new log, sequence 167905
Checkpoint not complete
Current log# 5 seq# 167904 mem# 0: /data02/oradata/cn3fsa/redo5a.rdo
Current log# 5 seq# 167904 mem# 1: /data03/oradata/cn3fsa/redo5b.rdo
Thu Mar 15 22:59:47 2012
Thread 1 advanced to log sequence 167905 (LGWR switch)
Current log# 7 seq# 167905 mem# 0: /data04/oradata/cn3fsa/redo7a.rdo
Current log# 7 seq# 167905 mem# 1: /data01/oradata/cn3fsa/redo7b.rdo
Thu Mar 15 22:59:48 2012
Killing enqueue blocker (pid=524594) on resource CF-00000000-00000000
by killing session 550.1
Thu Mar 15 22:59:50 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_ckpt_524594.trc:
ORA-00028: your session has been killed
Thu Mar 15 22:59:50 2012
CKPT: terminating instance due to error 28
Instance terminated by CKPT, pid = 524594
Fri Mar 16 08:50:50 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
IMODE=BR
ILAT =61
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 500
sessions = 555
从上面错误,我们可以判断是arch1无法获得CF锁,而阻塞者 524594 很可能是后台核心进行,以至于被kill后导致实例crash了。
这里需要说明一下的是该机制是这样的:
当进行需要获取CF enqueue时,最大等待时间为900s,超过900s时,将强制kill阻塞进程。通过如下隐含参数控制这个timeout.
SQL> SHOW parameter controlfile_enq
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_controlfile_enqueue_dump BOOLEAN FALSE
_controlfile_enqueue_holding_time INTEGER 120
_controlfile_enqueue_timeout INTEGER 900
_kill_controlfile_enqueue_blocker BOOLEAN TRUE
下面我们来看看进程 524594 是什么进程?
通过trace我们可以发现,如下:
Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)
我们可以发现,原来是ckpt进程被干掉了,导致数据库crash了,那么我们再进一步分析,是什么进程kill ckpt进程了呢?从上面alert log来看,error是写到arch1 trace里面的,所以这里我想应该是跟lgwr进程有关系。
从网友提供的trace来看,其中lgwr trace都是如下信息:
*** 2012-03-15 22:53:53.488
Warning: log write time 600ms, size 7893KB
*** 2012-03-15 22:54:00.669
Warning: log write time 540ms, size 7362KB
*** 2012-03-15 22:54:18.592
Warning: log write time 640ms, size 4040Khttp://www.oracleplus.netB
*** 2012-03-15 22:54:19.182
Warning: log write time 590ms, size 7412KB
*** 2012-03-15 22:54:30.110
Warning: log write time 910ms, size 538KB
*** 2012-03-15 22:54:36.899
Warning: log write time 550ms, size 7595KB
*** 2012-03-15 22:54:48.679
Warning: log write time 720ms, size 10229KB
*** 2012-03-15 22:54:55.736
Warning: log write time 510ms, size 7491KB
*** 2012-03-15 22:55:07.782
Warning: log write time 830ms, size 4178KB
说明lgwr在不停写,且在等待。