【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启 总结分析过程
时间:2016-12-22 10:58 来源:Oracle研究中心 作者:网络 点击:
次
天萃荷净
Oracle研究中心案例分析:运维DBA反映Oracle RAC环境数据库的一个实例节点被莫名的重启,报错ORA-29740等。详细介绍分析过程。
本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: Rac 实例被莫名重启案例分析
某客户的一套rac,其中node1 被莫名的重启了,如下:
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Doing block recovery for file 204 block 35113
Wed Aug 14 08:26:00 2013
Shutting down instance (abort)
License high water mark = 5209
Wed Aug 14 08:26:03 2013
Termination issued to instance processes. Waiting for the processes to exit
Wed Aug 14 08:26:09 2013
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 15573208
Wed Aug 14 08:26:10 2013
Instance terminated by USER, pid = 15335602
Wed Aug 14 08:27:17 2013
Starting ORACLE instance (normal)
同时在该时间点,rac的另外一个节点进行了reconfig,信息如下:
Wed Aug 14 08:23:20 2013
IPC Send timeout detected.Sender: ospid 454822
Receiver: inst 1 binc 57382989 ospid 14856422
Wed Aug 14 08:23:29 2013
IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10
Wed Aug 14 08:23:29 2013
Communications reconfiguration: instance_number 1
Wed Aug 14 08:23:30 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:19 2013
Evicting instance 1 from cluster
Wed Aug 14 08:26:09 2013
Reconfiguration started (old inc 4, new inc 8)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Aug 14 08:26:10 2013
LMS 5: 23 GCS shadows cancelled, 1 closed
Wed Aug 14 08:26:10 2013
从上面这部分信息看不出什么东西,我们仔细来看下node1 的alert log信息:
Wed Aug 14 08:23:20 2013
IPC Send timeout detected. Receiver ospid 14856422 ++++根据ospid我们发现,14856422 正是lms3.
Receiver is waiting for a latch dumping latch state for receiver -16444
Wed Aug 14 08:23:50 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc:
Wed Aug 14 08:23:54 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:51 2013
Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc:
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc:
ORA-29740: evicted by member 1, group incarnation 6
Wed Aug 14 08:25:54 2013 +++++lmon 进程终止,抛出ora-29740错误
LMON: terminating instance due to error 29740
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc:
ORA-29740: evicted by member , group incarnation
……..
……
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Doing block recovery for file 204 block 35113
Wed Aug 14 08:26:00 2013
Shutting down instance (abort)
同时,我们查看node 2节点的alert log也发现出现了node 1节点lms3进程超时的信息,如下:
Wed Aug 14 08:23:20 2013
IPC Send timeout detected.Sender: ospid 454822 +++++ crm2 节点的lms3进程
Receiver: inst 1 binc 57382989 ospid 14856422 +++++ crm1 节点的lms3进程
Wed Aug 14 08:23:29 2013
IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10
Wed Aug 14 08:23:29 2013
Communications reconfiguration: instance_number 1
Wed Aug 14 08:23:30 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:19 2013
Evicting instance 1 from cluster
Wed Aug 14 08:26:09 2013
Reconfiguration started (old inc 4, new inc 8)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Aug 14 08:26:10 2013
LMS 5: 23 GCS shadows cancelled, 1 closed
Wed Aug 14 08:26:10 2013
通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc,可以确认,454822 进程为crm2节点的lms3进程. 如下:
Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine
and Real Application Testing options
ORACLE_HOME = /oracle/app/product/10.2.0/db_1
System name: AIX
Node name: ngcrmdb2
Release: 3
Version: 5
Machine: 00C8A6E24C00
Instance name: ngcrmdb2
Redo thread mounted by this instance: 0
Oracle process number: 10
Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3)
......
......
*** 2013-08-12 10:34:58.266
kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192)
*** 2013-08-14 08:23:20.471
*** 2013-08-14 08:23:20.474
IPC Send timeout detected.Sender: ospid 454822
Receiver: inst 1 binc 57382989 ospid 14856422
SKGXPCTX: 0x1104103b0 ctx
WAIT HISTORY
Time(msec) Wait Type Return Code
---------- --------- ------------
14 NORMAL POSTED
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
30 NORMAL TIMEDOUT
wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 8
timed wait receives 87
admno 0xbc1dc98 admport:
SSKGXPT 0x104111c0 flags socket no 11 IP 192.168.5.2 UDP 33191
context timestamp 0x57cd2bd2
buffers queued on port 1104cff40
sconno accono ertt state seq# sent async sync rtrans acks
0x7dd68acd 0x12d3831a 1024 3 42918 292492860 292492982 0 2695 286577042
slot 0 rqh=1110ca6c8
seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1
slot 1 rqh=1110ca2e0
seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1
slot 2 rqh=1110c9ef8
……
我们可以看出是因为实例之间lms3进程通信交互出现超时导致crm1节点被驱逐重启。根据crm1节点的alert log信息,我们可以整理一个如下的时间关系:
Aug 14 08:23:20 –lms3 Time out(lmon check发现)
Aug 14 08:25:54 –抛出ORA-29740: evicted by member 1, group incarnation 6错误,lmon 进程终止
Aug 14 08:25:54 –完成block recover(Doing block recovery for file 204 block 35113),由smon完成.
Aug 14 08:26:00 –crm1实例 abort关闭.
通过node 1节点的告警日志,我们可以发现,在Aug 14 08:23:20 时刻,lms3进程出现time out。通常来讲,lms进程出现超时,可能是由于系统压力过大,即load过高导致。 下面通过分析lgwr,lmd以及lmon 进程的trace来证实这一点。
首先我们来分析一下lgwr trace:
*** 2013-08-13 02:29:49.603
Warning: log write time 1500ms, size 278KB
*** 2013-08-13 03:24:00.496
Warning: log write time 1440ms, size 13397KB