天萃荷净
Oracle研究中心案例分析:运维DBA反映Oracle RAC重启,分析原因为脑裂导致,结合日志分析产生原因。
本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: 9i rac 脑裂导致节点重启
Oracle 9i rac 某节点重启case的一些trace,大概的分析了一下,如下:
首先我们来看1节点的alert log,如下:
Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster
Tue Oct 18 11:06:38 2011
Reconfiguration started (old inc 18, new inc 19)
List of nodes:0
Tue Oct 18 11:06:38 2011
Reconfiguration started (old inc 18, new inc 20)
我们可以看到在11:06:34 cluster开始驱逐2节点。
下面我们来看2节点的信息:
Tue Oct 18 10:51:57 2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving log 6 thread 2 sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/siebdb/bdump/siebdb2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
Tue Oct 18 11:03:45 2011
Tue Oct 18 11:07:56 2011
Reconfiguration started (old inc 0, new inc 21)
List of nodes:0 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS remote-cache requests
Tue Oct 18 11:08:19 2011
0 write requests issued in 1511603 GCS resources
436 PIs marked suspect, 0 flush PI msgs
Tue Oct 18 11:08:22 2011
Reconfiguration complete
从2节点的alert 信息可以得到如下一个时间关系:
10:51:57 --> IPC Send 超时
10:51:59 --> 通信失败 cluster尝试关闭实例2 (instance 0 其中0表示关闭)
10:53:45 --> 出现脑裂
11:02:14 --> 尝试启动p001进程(从这里推测,前面的22239应该是P001进程)
11:03:45 --> 实例终止,随后出现ORA-00481错误
从前面的分析来看,可以断定是脑裂导致实例2被crash,那么是什么原因导致脑裂的呢?
最后我们来看alert中提到的lmon 进程的trace信息:
*** SESSION ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10 Senders 10
GES IPC: Buffers Receive 1000 Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size Regular 440 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2011-10-18 07:09:01.649
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
Name Service normal
Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec ====> 同步超时时间660s
List of nodes:
0 1 ====> cluster node编号
Global Resource Directory frozen
node 0
release 9 2 0 7 ====> oracle version
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
Total master weight = 6244032
Dead inst
Join inst 0 1
Exist inst
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-18 07:09:11.515
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 1518892 GCS resources
10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470 ====> 这个时间点很重要
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3 ====> 通信失败
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311 ====> 出现脑裂
kjxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information ====> 触发IMR
Member 1, thread 2, state 4, flags 0x00a1
RR seq 19, propstate 3, pending propstate 0
rcfg rsn 3, rcfg time 21546700, mem ct 2
master 1, master rcfg time 21546700
Member information:
Member 0, incarn 18, version 1840361946
thrd 1, prev thrd 65535, status 0x0007, err 0x0002
valct 0
Member 1, incarn 18, version 816352
thrd 2, prev thrd 65535, status 0x0107, err 0x0000
valct 2
Group name: SIEBDB
Member id: 1
Cached SKGXN event: 0
Group State:
State: 18 1
Commited Map: 0 1
New Map: 0 1
SKGXN Map: 0 1
Master node: 0
Memcnt 2 Rcvcnt 0
Substate Proposal: false
Inc Proposal:
incarn 0 memcnt 0 master 0
proposal false matched false
map:
Master Inc State:
incarn 0 memcnt 0 agrees 0 flag 0x1
wmap:
nmap:
ubmap:
Submitting asynchronized dOracleоump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19
从整个时间点来看,我们可以发现整个触发脑裂到脑裂处理完成的时间超不多15分钟。
大约时间点是这样:10:51 - 11:07
关于reconfigurtion reason 有如下几种解释:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
如果说前面那个660s就是心跳超时时间的阈值的话,那么就不会出现这个故障。
据同事说,该rac cluster节点时间相差3分钟,在当前凌晨,硬件厂商在换电池。
这样来看,基本上可以断定就是由于时间不同步导致脑裂,我们知道,在有HA等第三方cluster软件时,io fencing是由其来完成的,所以这里我还不知道具体这个阈值是多少?
换句话说,就是超过多少秒?才出现脑裂呢?难得是前面的660s?
最后补充一下,lmon主要是检查如下两种心跳;
1. 网络心跳 说白了,就是指cluster心跳(应该是通过ping机制来完成的)
2. controlfile心跳 ckpt进程会每隔3s去更新一次controlfile,用此来判断。
--------------------------------------ORACLE-DBA----------------------------------------
最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC脑裂导致节点重启原因分析