shedlock 重启系统报错问题_【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启总结分析过程...

最新推荐文章于 2021-04-15 09:20:40 发布

weixin_39989159

最新推荐文章于 2021-04-15 09:20:40 发布

阅读量495

点赞数

文章标签： shedlock 重启系统报错问题

本文链接：https://blog.csdn.net/weixin_39989159/article/details/111775750

版权

本文详细分析了一起Oracle RAC环境中，由于节点间通信超时导致的实例异常重启案例。在分析过程中，发现了LMS3进程的IPC超时，引发了ORA-29740错误，最终导致实例被驱逐重启。通过对lgwr、lmd和lmon进程的trace文件进行深入研究，揭示了由于大量SQL执行导致的系统负载过高是触发问题的根本原因。

摘要由CSDN通过智能技术生成

【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启总结分析过程

时间:2016-12-22 10:58 来源:Oracle研究中心作者:网络点击:

次

天萃荷净

Oracle研究中心案例分析：运维DBA反映Oracle RAC环境数据库的一个实例节点被莫名的重启，报错ORA-29740等。详细介绍分析过程。

本站文章除注明转载外，均为本站原创：转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: Rac 实例被莫名重启案例分析

某客户的一套rac，其中node1 被莫名的重启了，如下：

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Doing block recovery for file 204 block 35113

Wed Aug 14 08:26:00 2013

Shutting down instance (abort)

License high water mark = 5209

Wed Aug 14 08:26:03 2013

Termination issued to instance processes. Waiting for the processes to exit

Wed Aug 14 08:26:09 2013

Instance termination failed to kill one or more processes

Instance terminated by LMON, pid = 15573208

Wed Aug 14 08:26:10 2013

Instance terminated by USER, pid = 15335602

Wed Aug 14 08:27:17 2013

Starting ORACLE instance (normal)

同时在该时间点，rac的另外一个节点进行了reconfig，信息如下：

Wed Aug 14 08:23:20 2013

IPC Send timeout detected.Sender: ospid 454822

Receiver: inst 1 binc 57382989 ospid 14856422

Wed Aug 14 08:23:29 2013

IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10

Wed Aug 14 08:23:29 2013

Communications reconfiguration: instance_number 1

Wed Aug 14 08:23:30 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:19 2013

Evicting instance 1 from cluster

Wed Aug 14 08:26:09 2013

Reconfiguration started (old inc 4, new inc 8)

List of nodes:

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 14 08:26:10 2013

LMS 5: 23 GCS shadows cancelled, 1 closed

Wed Aug 14 08:26:10 2013

从上面这部分信息看不出什么东西，我们仔细来看下node1 的alert log信息：

Wed Aug 14 08:23:20 2013

IPC Send timeout detected. Receiver ospid 14856422 ++++根据ospid我们发现，14856422 正是lms3.

Receiver is waiting for a latch dumping latch state for receiver -16444

Wed Aug 14 08:23:50 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc:

Wed Aug 14 08:23:54 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:51 2013

Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc:

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc:

ORA-29740: evicted by member 1, group incarnation 6

Wed Aug 14 08:25:54 2013 +++++lmon 进程终止，抛出ora-29740错误

LMON: terminating instance due to error 29740

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc:

ORA-29740: evicted by member , group incarnation

……..

……

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Doing block recovery for file 204 block 35113

Wed Aug 14 08:26:00 2013

Shutting down instance (abort)

同时，我们查看node 2节点的alert log也发现出现了node 1节点lms3进程超时的信息，如下：

Wed Aug 14 08:23:20 2013

IPC Send timeout detected.Sender: ospid 454822 +++++ crm2 节点的lms3进程

Receiver: inst 1 binc 57382989 ospid 14856422 +++++ crm1 节点的lms3进程

Wed Aug 14 08:23:29 2013

IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10

Wed Aug 14 08:23:29 2013

Communications reconfiguration: instance_number 1

Wed Aug 14 08:23:30 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:19 2013

Evicting instance 1 from cluster

Wed Aug 14 08:26:09 2013

Reconfiguration started (old inc 4, new inc 8)

List of nodes:

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 14 08:26:10 2013

LMS 5: 23 GCS shadows cancelled, 1 closed

Wed Aug 14 08:26:10 2013

通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc，可以确认，454822 进程为crm2节点的lms3进程. 如下：

Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine

and Real Application Testing options

ORACLE_HOME = /oracle/app/product/10.2.0/db_1

System name: AIX

Node name: ngcrmdb2

Release: 3

Version: 5

Machine: 00C8A6E24C00

Instance name: ngcrmdb2

Redo thread mounted by this instance: 0

Oracle process number: 10

Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3)

......

*** 2013-08-12 10:34:58.266

kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192)

*** 2013-08-14 08:23:20.471

*** 2013-08-14 08:23:20.474

IPC Send timeout detected.Sender: ospid 454822

Receiver: inst 1 binc 57382989 ospid 14856422

SKGXPCTX: 0x1104103b0 ctx

WAIT HISTORY

Time(msec) Wait Type Return Code

---------- --------- ------------

14 NORMAL POSTED

30 NORMAL TIMEDOUT

wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2

user cpu time since last wait 0 sec 0 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 8

timed wait receives 87

admno 0xbc1dc98 admport:

SSKGXPT 0x104111c0 flags socket no 11 IP 192.168.5.2 UDP 33191

context timestamp 0x57cd2bd2

buffers queued on port 1104cff40

sconno accono ertt state seq# sent async sync rtrans acks

0x7dd68acd 0x12d3831a 1024 3 42918 292492860 292492982 0 2695 286577042

slot 0 rqh=1110ca6c8

seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1

slot 1 rqh=1110ca2e0

seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1

slot 2 rqh=1110c9ef8

……

我们可以看出是因为实例之间lms3进程通信交互出现超时导致crm1节点被驱逐重启。根据crm1节点的alert log信息，我们可以整理一个如下的时间关系:

Aug 14 08:23:20 –lms3 Time out(lmon check发现)

Aug 14 08:25:54 –抛出ORA-29740: evicted by member 1, group incarnation 6错误，lmon 进程终止

Aug 14 08:25:54 –完成block recover(Doing block recovery for file 204 block 35113)，由smon完成.

Aug 14 08:26:00 –crm1实例 abort关闭.

通过node 1节点的告警日志，我们可以发现，在Aug 14 08:23:20 时刻，lms3进程出现time out。通常来讲，lms进程出现超时，可能是由于系统压力过大，即load过高导致。下面通过分析lgwr,lmd以及lmon 进程的trace来证实这一点。

首先我们来分析一下lgwr trace：

*** 2013-08-13 02:29:49.603

Warning: log write time 1500ms, size 278KB

*** 2013-08-13 03:24:00.496

Warning: log write time 1440ms, size 13397KB

最低0.47元/天解锁文章

weixin_39989159

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
shedlock 重启系统报错问题_【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启总结分析过程...

【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启总结分析过程时间:2016-12-22 10:58来源:Oracle研究中心作者:网络点击:次天萃荷净Oracle研究中心案例分析：运维DBA反映Oracle RAC环境数据库的一个实例节点被莫名的重启，报错ORA-29740等。详细介绍分析过程。本站文章除注明转载外，均为本站原创：转载自love wife ...
复制链接

扫一扫

shedlock 重启系统报错问题_【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启 总结分析过程...

“相关推荐”对你有帮助么？

shedlock 重启系统报错问题_【案例】Oracle报错ORA-29740 RAC实例节点被莫名重启总结分析过程...