HP平台,9i RAC instance 2被驱逐故障诊断

前几日上午,在毫不知情的情况下,突然手机收到了系统的告警短信,报一营业系统无法连接,汗~,无法连接数据库,这是多么严重的告警,何况是营业厅的核心系统,直接影响用户体验。遂立即登录系统查看,发现双节点的RAC环境中节点2上的instance宕掉了,还好是双节点,可能性能会有所影响但不至于营业厅无法办理业务(这时候,RAC的好处显现的可谓淋淋尽致了)。

检查进程:

■■db2:[/oracle]$ps -ef | grep ora_
  oracle  914114 3548012   0 19:00:23  pts/3  0:00 grep ora_ 
■■db2:[/oracle]$ps -ef | grep lsnr
  oracle 22257 22179  0 11:06:48 pts/1     0:00 grep lsnr
  oracle 15157     1  0 07:09:39 ?         0:00 /orarac/app/oracle/product/9.2.0/bin/tnslsnr LISTENER -inherit 

系统中已经没有任何instance’s background processes了,不过监听进程还在,监听为:LISTENER

还等啥呢,赶紧把库起来吧

SQL> startup
ORACLE instance started.

Total System Global Area 3.2212E+10 bytes
Fixed Size                   782472 bytes
Variable Size            4815060992 bytes
Database Buffers         2.7380E+10 bytes
Redo Buffers               16003072 bytes
Database mounted.
Database opened.

还好,数据库正常open了。

表象的东西解决后,就该更为深层次的分析是什么原因导致的instance down了。

节点1alert日志:

Tue Oct 18 10:56:34 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster

在节点1的系统时间为:

10:56:34  集群发生脑裂,clusterware split-brain
11:03:00 
启动P001进程发生超时
11:06:34 
节点2上的instance 2被逐出集群了

节点2alert日志:

Tue Oct 18 10:51:57 2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving  log 6 thread 2 sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/
■■db/bdump/■■db2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
…………………..
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/
■■db/bdump/■■db2_lmon_15081.trc:
ORA-29740: evicted by member 1, group incarnation 19
Tue Oct 18 11:03:45 2011
System state dump is made for local instance

在节点2的系统时间为:

10:51:57  检测到IPC send timeout
10:51:59  Communications reconfiguration:instance 0,instance 0
表示实例关闭
10:53:45 
集群发生脑裂,clusterware split-brain
11:02:14 
启动P001进程发生超时
11:03:45 
发生error 481instance down,并出现ORA-29740错误,我手机也是在这个时间点接到告警短信的

eygle的说法,主要是在资源紧张、网络拥堵等情况下,就有可能发生IPC超时的问题,而cluster随后就会将问题节点驱逐,引发新一轮的Reconfiguration

在来看看产生的trc文件,■■db2_lmon_15081.trc

*** SESSION ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10  Senders 10
GES IPC: Buffers  Receive 1000  Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size  Regular 440  Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
*** 2011-10-18 07:09:01.649
     Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
     Name Service normal
     Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec
List of nodes:
 0 1
 Global Resource Directory frozen
node 0
release 9 2 0 7
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
 Total master weight = 6244032
 Dead  inst
 Join  inst 0 1
 Exist inst
 Active Sendback Threshold = 50 %
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
*** 2011-10-18 07:09:11.515
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS cache requests
 0 write requests issued in 1518892 GCS resources
 10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
     Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information
  Member 1, thread 2, state 4, flags 0x00a1
  RR seq 19, propstate 3, pending propstate 0
  rcfg rsn 3, rcfg time 21546700, mem ct 2
  master 1, master rcfg time 21546700
 
  Member information:
    Member 0, incarn 18, version 1840361946
      thrd 1, prev thrd 65535, status 0x0007, err 0x0002
      valct 0
    Member 1, incarn 18, version 816352
      thrd 2, prev thrd 65535, status 0x0107, err 0x0000
      valct 2
 
Group name:
■■DB
Member id: 1
Cached SKGXN event: 0
Group State:
  State: 18 1
  Commited Map: 0 1
  New Map: 0 1
  SKGXN Map: 0 1
  Master node: 0
  Memcnt 2  Rcvcnt 0
  Substate Proposal: false
Inc Proposal:
  incarn 0  memcnt 0  master 0
  proposal false  matched false
  map:
Master Inc State:
  incarn 0  memcnt 0  agrees 0  flag 0x1
  wmap:
  nmap:
  ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19

trc文件中可以得到如下信息:

*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.

在节点2的系统时间为:

10:51:59  Initiating reconfig,reason 3
              Reconfiguration started,reason 3
,这里的reason 3是重要信息

*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)

在节点2的系统时间为:

10:53:45  Waiting for split-brain resolution,此时cluster正等待解决脑裂
11:03:45  Evicted by 1
,此时节点2上的instance被驱逐

查询metalink ID 219316.1

For troubleshooting ORA-29740 errors,the ‘reason’ will be very important.In the above example,the first section indicates the reason for the initiated reconfiguration.The reason are as follows:


Reason 0 = No reconfiguration 
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

再看看引起 reason 3的几种可能性:

a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of  another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.

casereason标识就是3,所以应该Communications Failure。我想最可能的原因就是心跳出了问题(在当日凌晨2点,给主机换硬件时,曾因网络问题导致心跳异常),其他的原因,如:资源紧张,网络堵塞我想基本上应该可以排除的,这边的硬件环境算是比较“高档”了,并且,从平时的监控来看,节点2的负载还是可以的。

上面的关于重要信息的时间点写的比较乱,简单整理一个在日志中记录重要事件的时间点:

时间

节点1

节点2

10:51:57

 

IPC Send timeout detected

10:51:59

 

Communications reconfiguration:instance 0

10:53:45

 

Waiting for clusterware split-brain resolution

10:56:34

Waiting for clusterware split-brain resolution

 

11:02:14

 

Timed out trying to start process P001.

11:03:00

Timed out trying to start process P001.

 

11:03:45

 

USER: terminating instance due to error 481
ORA-29740: evicted by member 1, group incarnation 19

11:06:34

Evicting instance 2 from cluster

 

不知为何原因,当日凌成2点重启主机后,节点2上的NTP服务异常,直到节点2上的instance 异常,处理故障时发现,节点2比节点1慢了3分钟,具体点,应该是3分钟11秒左右。当时我逐步以每次30秒调整到与节点2的主机系统时间基本保持了一致,后来主机工程师调整了NTP后,该主机上的NTP服务也恢复正常了,此时,两个节点的系统时间均与时间服务器保持同步了。

所以上面的时间点不算很准确,因为节点2的系统时间如果加上314秒的话,在日志中的某些重要事件信息的时间点,两个节点基本就吻合了。如下:

时间

节点1

节点2

10:51:57

 

IPC Send timeout detected

10:51:59

 

Communications reconfiguration:instance 0

10:53:45

 

Waiting for clusterware split-brain resolution

10:56:34

Waiting for clusterware split-brain resolution

 

11:02:14

 

Timed out trying to start process P001.

11:03:00

Timed out trying to start process P001.

 

11:03:45

 

USER: terminating instance due to error 481
ORA-29740: evicted by member 1, group incarnation 19

11:06:34

Evicting instance 2 from cluster

 

在上面这个表格中,发生split-brain时,其实两个节点是同时发生的,准确时间是:10:56:34;在节点的alert日志中发生instance 2被驱逐事件的时间点,其实节点2上报ORA-29740错误的时间其实也是同时发生的。

这样以来就很清楚的知道了,节点2

10:51:57      (这个时间点其实应该再加上311秒,即10:55:08)发生IPC Send imeout

10:56:34      集群发生脑裂(Split-brain

11:06:34      节点2上的实例 instance 2被驱逐

 

至于,心跳超时的原因,个人感觉最好的解释就是,因两个节点相差311秒(即:节点2 节点1 3分钟11秒),导致心跳异常,从而引起脑裂以至instance 2被驱逐。

如果大家有谁遇到过相同问题的,欢迎与我沟通、交流,谢谢!

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/25834554/viewspace-709574/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/25834554/viewspace-709574/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值