shedlock 重启系统报错问题_【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启 总结分析过程...

本文详细分析了一起Oracle RAC环境中,由于节点间通信超时导致的实例异常重启案例。在分析过程中,发现了LMS3进程的IPC超时,引发了ORA-29740错误,最终导致实例被驱逐重启。通过对lgwr、lmd和lmon进程的trace文件进行深入研究,揭示了由于大量SQL执行导致的系统负载过高是触发问题的根本原因。
摘要由CSDN通过智能技术生成

【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启 总结分析过程

时间:2016-12-22 10:58   来源:Oracle研究中心   作者:网络   点击:

天萃荷净

Oracle研究中心案例分析:运维DBA反映Oracle RAC环境数据库的一个实例节点被莫名的重启,报错ORA-29740等。详细介绍分析过程。

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Rac 实例被莫名重启案例分析

某客户的一套rac,其中node1 被莫名的重启了,如下:

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Doing block recovery for file 204 block 35113

Wed Aug 14 08:26:00 2013

Shutting down instance (abort)

License high water mark = 5209

Wed Aug 14 08:26:03 2013

Termination issued to instance processes. Waiting for the processes to exit

Wed Aug 14 08:26:09 2013

Instance termination failed to kill one or more processes

Instance terminated by LMON, pid = 15573208

Wed Aug 14 08:26:10 2013

Instance terminated by USER, pid = 15335602

Wed Aug 14 08:27:17 2013

Starting ORACLE instance (normal)

同时在该时间点,rac的另外一个节点进行了reconfig,信息如下:

Wed Aug 14 08:23:20 2013

IPC Send timeout detected.Sender: ospid 454822

Receiver: inst 1 binc 57382989 ospid 14856422

Wed Aug 14 08:23:29 2013

IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10

Wed Aug 14 08:23:29 2013

Communications reconfiguration: instance_number 1

Wed Aug 14 08:23:30 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:19 2013

Evicting instance 1 from cluster

Wed Aug 14 08:26:09 2013

Reconfiguration started (old inc 4, new inc 8)

List of nodes:

1

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 14 08:26:10 2013

LMS 5: 23 GCS shadows cancelled, 1 closed

Wed Aug 14 08:26:10 2013

从上面这部分信息看不出什么东西,我们仔细来看下node1 的alert log信息:

Wed Aug 14 08:23:20 2013

IPC Send timeout detected. Receiver ospid 14856422     ++++根据ospid我们发现,14856422 正是lms3.

Receiver is waiting for a latch dumping latch state for receiver -16444

Wed Aug 14 08:23:50 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc:

Wed Aug 14 08:23:54 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:51 2013

Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc:

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc:

ORA-29740: evicted by member 1, group incarnation 6

Wed Aug 14 08:25:54 2013                              +++++lmon 进程终止,抛出ora-29740错误

LMON: terminating instance due to error 29740

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc:

ORA-29740: evicted by member , group incarnation

……..

……

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Doing block recovery for file 204 block 35113

Wed Aug 14 08:26:00 2013

Shutting down instance (abort)

同时,我们查看node 2节点的alert log也发现出现了node 1节点lms3进程超时的信息,如下:

Wed Aug 14 08:23:20 2013

IPC Send timeout detected.Sender: ospid 454822     +++++ crm2 节点的lms3进程

Receiver: inst 1 binc 57382989 ospid 14856422      +++++ crm1 节点的lms3进程

Wed Aug 14 08:23:29 2013

IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10

Wed Aug 14 08:23:29 2013

Communications reconfiguration: instance_number 1

Wed Aug 14 08:23:30 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:19 2013

Evicting instance 1 from cluster

Wed Aug 14 08:26:09 2013

Reconfiguration started (old inc 4, new inc 8)

List of nodes:

1

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 14 08:26:10 2013

LMS 5: 23 GCS shadows cancelled, 1 closed

Wed Aug 14 08:26:10 2013

通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc,可以确认,454822 进程为crm2节点的lms3进程. 如下:

Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine

and Real Application Testing options

ORACLE_HOME = /oracle/app/product/10.2.0/db_1

System name: AIX

Node name: ngcrmdb2

Release: 3

Version: 5

Machine: 00C8A6E24C00

Instance name: ngcrmdb2

Redo thread mounted by this instance: 0

Oracle process number: 10

Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3)

......

......

*** 2013-08-12 10:34:58.266

kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192)

*** 2013-08-14 08:23:20.471

*** 2013-08-14 08:23:20.474

IPC Send timeout detected.Sender: ospid 454822

Receiver: inst 1 binc 57382989 ospid 14856422

SKGXPCTX: 0x1104103b0 ctx

WAIT HISTORY

Time(msec)  Wait Type  Return Code

----------  ---------  ------------

14   NORMAL   POSTED

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

30   NORMAL   TIMEDOUT

wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2

user cpu time since last wait 0 sec 0 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 8

timed wait receives 87

admno 0xbc1dc98 admport:

SSKGXPT 0x104111c0 flags   socket no 11  IP 192.168.5.2  UDP 33191

context timestamp 0x57cd2bd2

buffers queued on port 1104cff40

sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks

0x7dd68acd 0x12d3831a   1024      3  42918 292492860 292492982      0   2695 286577042

slot 0 rqh=1110ca6c8

seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1

slot 1 rqh=1110ca2e0

seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1

slot 2 rqh=1110c9ef8

……

我们可以看出是因为实例之间lms3进程通信交互出现超时导致crm1节点被驱逐重启。根据crm1节点的alert log信息,我们可以整理一个如下的时间关系:

Aug 14 08:23:20 –lms3 Time out(lmon check发现)

Aug 14 08:25:54 –抛出ORA-29740: evicted by member 1, group incarnation 6错误,lmon 进程终止

Aug 14 08:25:54 –完成block recover(Doing block recovery for file 204 block 35113),由smon完成.

Aug 14 08:26:00 –crm1实例 abort关闭.

通过node 1节点的告警日志,我们可以发现,在Aug 14 08:23:20 时刻,lms3进程出现time out。通常来讲,lms进程出现超时,可能是由于系统压力过大,即load过高导致。 下面通过分析lgwr,lmd以及lmon 进程的trace来证实这一点。

首先我们来分析一下lgwr trace:

*** 2013-08-13 02:29:49.603

Warning: log write time 1500ms, size 278KB

*** 2013-08-13 03:24:00.496

Warning: log write time 1440ms, size 13397KB

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值