HP平台，9i RAC instance 2被驱逐故障诊断

最新推荐文章于 2021-04-13 19:42:48 发布

cndoc40428

最新推荐文章于 2021-04-13 19:42:48 发布

阅读量189

点赞数

文章标签：数据库

前几日上午，在毫不知情的情况下，突然手机收到了系统的告警短信，报一营业系统无法连接，汗~，无法连接数据库，这是多么严重的告警，何况是营业厅的核心系统，直接影响用户体验。遂立即登录系统查看，发现双节点的RAC环境中节点2上的instance宕掉了，还好是双节点，可能性能会有所影响但不至于营业厅无法办理业务（这时候，RAC的好处显现的可谓淋淋尽致了）。

检查进程：

■■db2:[/oracle]$ps -ef | grep ora_
oracle 914114 3548012   0 19:00:23 pts/3 0:00 grep ora_
■■db2:[/oracle]$ps -ef | grep lsnr
oracle 22257 22179 0 11:06:48 pts/1     0:00 grep lsnr
oracle 15157     1 0 07:09:39 ?         0:00 /orarac/app/oracle/product/9.2.0/bin/tnslsnr LISTENER -inherit

系统中已经没有任何instance’s background processes了，不过监听进程还在，监听为：LISTENER。

还等啥呢，赶紧把库起来吧

SQL> startup
ORACLE instance started.

Total System Global Area 3.2212E+10 bytes
Fixed Size                   782472 bytes
Variable Size            4815060992 bytes
Database Buffers         2.7380E+10 bytes
Redo Buffers               16003072 bytes
Database mounted.
Database opened.

还好，数据库正常open了。

表象的东西解决后，就该更为深层次的分析是什么原因导致的instance down了。

节点1的alert日志：

Tue Oct 18 10:56:34 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster

在节点1的系统时间为：

10:56:34 集群发生脑裂，clusterware split-brain
11:03:00 启动P001进程发生超时
11:06:34 节点2上的instance 2被逐出集群了

节点2的alert日志：

Tue Oct 18 10:51:57 2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving log 6 thread 2 sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/■■db/bdump/■■db2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
…………………..
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/■■db/bdump/■■db2_lmon_15081.trc:
ORA-29740: evicted by member 1, group incarnation 19
Tue Oct 18 11:03:45 2011
System state dump is made for local instance

在节点2的系统时间为：

10:51:57 检测到IPC send timeout
10:51:59 Communications reconfiguration:instance 0,instance 0表示实例关闭
10:53:45 集群发生脑裂，clusterware split-brain
11:02:14 启动P001进程发生超时
11:03:45 发生error 481，instance down，并出现ORA-29740错误，我手机也是在这个时间点接到告警短信的

据eygle的说法，主要是在资源紧张、网络拥堵等情况下，就有可能发生IPC超时的问题，而cluster随后就会将问题节点驱逐，引发新一轮的Reconfiguration。

在来看看产生的trc文件，■■db2_lmon_15081.trc

*** SESSION ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10 Senders 10
GES IPC: Buffers Receive 1000 Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size Regular 440 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
*** 2011-10-18 07:09:01.649
     Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
     Name Service normal
     Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec
List of nodes:
0 1
Global Resource Directory frozen
node 0
release 9 2 0 7
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
Total master weight = 6244032
Dead inst
Join inst 0 1
Exist inst
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-18 07:09:11.515
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 1518892 GCS resources
10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
     Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information
Member 1, thread 2, state 4, flags 0x00a1
  RR seq 19, propstate 3, pending propstate 0
rcfg rsn 3, rcfg time 21546700, mem ct 2
master 1, master rcfg time 21546700

Member information:
    Member 0, incarn 18, version 1840361946
      thrd 1, prev thrd 65535, status 0x0007, err 0x0002
      valct 0
    Member 1, incarn 18, version 816352
      thrd 2, prev thrd 65535, status 0x0107, err 0x0000
      valct 2

Group name: ■■DB
Member id: 1
Cached SKGXN event: 0
Group State:
State: 18 1
Commited Map: 0 1
New Map: 0 1
SKGXN Map: 0 1
Master node: 0
Memcnt 2 Rcvcnt 0
Substate Proposal: false
Inc Proposal:
incarn 0 memcnt 0 master 0
proposal false matched false
map:
Master Inc State:
incarn 0 memcnt 0 agrees 0 flag 0x1
wmap:
nmap:
ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19

从trc文件中可以得到如下信息：

*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.

在节点2的系统时间为：

10:51:59 Initiating reconfig,reason 3
Reconfiguration started,reason 3，这里的reason 3是重要信息

*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)

在节点2的系统时间为：

10:53:45 Waiting for split-brain resolution，此时cluster正等待解决脑裂
11:03:45 Evicted by 1，此时节点2上的instance被驱逐

查询metalink ID 219316.1

For troubleshooting ORA-29740 errors,the ‘reason’ will be very important.In the above example,the first section indicates the reason for the initiated reconfiguration.The reason are as follows:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

再看看引起 reason 3的几种可能性：

a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.

本case的reason标识就是3，所以应该Communications Failure。我想最可能的原因就是心跳出了问题（在当日凌晨2点，给主机换硬件时，曾因网络问题导致心跳异常），其他的原因，如：资源紧张，网络堵塞…我想基本上应该可以排除的，这边的硬件环境算是比较“高档”了，并且，从平时的监控来看，节点2的负载还是可以的。

上面的关于重要信息的时间点写的比较乱，简单整理一个在日志中记录重要事件的时间点：

时间

节点1

节点2