Oracle显示表裂开,【案例】Oracle RAC脑裂导致节点重启原因分析

天萃荷净

Oracle研究中心案例分析:运维DBA反映Oracle RAC重启,分析原因为脑裂导致,结合日志分析产生原因。

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 9i rac 脑裂导致节点重启

Oracle 9i rac 某节点重启case的一些trace,大概的分析了一下,如下:

首先我们来看1节点的alert log,如下:

Tue Oct 18 11:03:00 2011

Timed out trying to start process P001.

Tue Oct 18 11:06:34 2011

Evicting instance 2 from cluster

Tue Oct 18 11:06:38 2011

Reconfiguration started (old inc 18, new inc 19)

List of nodes:0

Tue Oct 18 11:06:38 2011

Reconfiguration started (old inc 18, new inc 20)

我们可以看到在11:06:34 cluster开始驱逐2节点。

下面我们来看2节点的信息:

Tue Oct 18 10:51:57 2011

IPC Send timeout detected. Sender ospid 22239

Tue Oct 18 10:51:59 2011

Communications reconfiguration: instance 0

Tue Oct 18 10:52:03 2011

ARC0: Completed archiving  log 6 thread 2 sequence 87953

Tue Oct 18 10:53:45 2011

Waiting for clusterware split-brain resolution

Tue Oct 18 11:02:14 2011

Timed out trying to start process P001.

Tue Oct 18 11:03:45 2011

USER: terminating instance due to error 481

Tue Oct 18 11:03:45 2011

Errors in file /orarac/app/oracle/admin/siebdb/bdump/siebdb2_lgwr_15116.trc:

ORA-00481: LMON process terminated with error

Tue Oct 18 11:03:45 2011

Tue Oct 18 11:07:56 2011

Reconfiguration started (old inc 0, new inc 21)

List of nodes:0 1

Global Resource Directory frozen

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Resources and enqueues cleaned out

Resources remastered 0

0 GCS shadows traversed, 0 cancelled, 0 closed

0 GCS resources traversed, 0 cancelled

set master node info

Submitted all remote-enqueue requests

Update rdomain variables

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

0 GCS shadows traversed, 0 replayed, 0 unopened

Submitted all GCS remote-cache requests

Tue Oct 18 11:08:19 2011

0 write requests issued in 1511603 GCS resources

436 PIs marked suspect, 0 flush PI msgs

Tue Oct 18 11:08:22 2011

Reconfiguration complete

从2节点的alert 信息可以得到如下一个时间关系:

10:51:57 --> IPC Send 超时

10:51:59 --> 通信失败 cluster尝试关闭实例2 (instance 0 其中0表示关闭)

10:53:45 --> 出现脑裂

11:02:14 --> 尝试启动p001进程(从这里推测,前面的22239应该是P001进程)

11:03:45 --> 实例终止,随后出现ORA-00481错误

从前面的分析来看,可以断定是脑裂导致实例2被crash,那么是什么原因导致脑裂的呢?

最后我们来看alert中提到的lmon 进程的trace信息:

*** SESSION ID:(3.1) 2011-10-18 07:08:57.162

GES IPC: Receivers 10  Senders 10

GES IPC: Buffers  Receive 1000  Send (i:6830 b:6830) Reserve 1000

GES IPC: Msg Size  Regular 440  Batch 2048

Batch msg size = 2048

Batching factor: enqueue replay 47, ack 52

Batching factor: cache replay 29 size per lock 64

kjxggin: receive buffer size = 32768

kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)

*** 2011-10-18 07:09:01.648

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 0 0.

*** 2011-10-18 07:09:01.649

Name Service frozen

kjxgmcs: Setting state to 0 1.

kjfcpiora: publish my weight 3122016

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 18 2.

Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 18 3.

Name Service recovery started

Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 18 4.

Multicasted all local name entries for publish

Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 18 5.

Name Service normal

Name Service recovery done

*** 2011-10-18 07:09:02.652

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 18 6.

*** 2011-10-18 07:09:02.652

*** 2011-10-18 07:09:02.653

Reconfiguration started (old inc 0, new inc 18)

Synchronization timeout interval: 660 sec  ====> 同步超时时间660s

List of nodes:

0 1                                       ====> cluster node编号

Global Resource Directory frozen

node 0

release 9 2 0 7                            ====> oracle version

node 1

release 9 2 0 7

res_master_weight for node 0 is 3122016

res_master_weight for node 1 is 3122016

Total master weight = 6244032

Dead  inst

Join  inst 0 1

Exist inst

Active Sendback Threshold = 50 %

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Resources and enqueues cleaned out

Resources remastered 0

0 GCS shadows traversed, 0 cancelled, 0 closed

0 GCS resources traversed, 0 cancelled

set master node info

Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 0 = 0

Update rdomain variables

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

*** 2011-10-18 07:09:11.515

0 GCS shadows traversed, 0 replayed, 0 unopened

Submitted all GCS cache requests

0 write requests issued in 1518892 GCS resources

10 PIs marked suspect, 0 flush PI msgs

*** 2011-10-18 07:09:27.724

Reconfiguration complete

*** 2011-10-18 07:09:35.232

kjxgrtmc2: Member 1 thread 2 mounted

*** 2011-10-18 10:51:59.470                   ====> 这个时间点很重要

kjxgrcomerr: Communications reconfig: instance 0 (18,18)

Submitting asynchronized dump request [2]

kjxgrrcfgchk: Initiating reconfig, reason 3   ====> 通信失败

*** 2011-10-18 10:52:04.499

kjxgmrcfg: Reconfiguration started, reason 3

kjxgmcs: Setting state to 18 0.

*** 2011-10-18 10:52:04.504

Name Service frozen

kjxgmcs: Setting state to 18 1.

*** 2011-10-18 10:52:04.514

Obtained RR update lock for sequence 18, RR seq 18

*** 2011-10-18 10:53:45.311                   ====> 出现脑裂

kjxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19

*** 2011-10-18 11:03:45.315

Voting results, upd 1, seq 19, bitmap: 0

*** 2011-10-18 11:03:45.315

kjxgrdtrt: Evicted by 1, seq (19, 19)

IMR state information                         ====> 触发IMR

Member 1, thread 2, state 4, flags 0x00a1

RR seq 19, propstate 3, pending propstate 0

rcfg rsn 3, rcfg time 21546700, mem ct 2

master 1, master rcfg time 21546700

Member information:

Member 0, incarn 18, version 1840361946

thrd 1, prev thrd 65535, status 0x0007, err 0x0002

valct 0

Member 1, incarn 18, version 816352

thrd 2, prev thrd 65535, status 0x0107, err 0x0000

valct 2

Group name: SIEBDB

Member id: 1

Cached SKGXN event: 0

Group State:

State: 18 1

Commited Map: 0 1

New Map: 0 1

SKGXN Map: 0 1

Master node: 0

Memcnt 2  Rcvcnt 0

Substate Proposal: false

Inc Proposal:

incarn 0  memcnt 0  master 0

proposal false  matched false

map:

Master Inc State:

incarn 0  memcnt 0  agrees 0  flag 0x1

wmap:

nmap:

ubmap:

Submitting asynchronized dOracleоump request [1]

error 29740 detected in background process

ORA-29740: evicted by member 1, group incarnation 19

从整个时间点来看,我们可以发现整个触发脑裂到脑裂处理完成的时间超不多15分钟。

大约时间点是这样:10:51 - 11:07

关于reconfigurtion reason 有如下几种解释:

Reason 0 = No reconfiguration

Reason 1 = The Node Monitor generated the reconfiguration.

Reason 2 = An instance death was detected.

Reason 3 = Communications Failure

Reason 4 = Reconfiguration after suspend

如果说前面那个660s就是心跳超时时间的阈值的话,那么就不会出现这个故障。

据同事说,该rac cluster节点时间相差3分钟,在当前凌晨,硬件厂商在换电池。

这样来看,基本上可以断定就是由于时间不同步导致脑裂,我们知道,在有HA等第三方cluster软件时,io fencing是由其来完成的,所以这里我还不知道具体这个阈值是多少?

换句话说,就是超过多少秒?才出现脑裂呢?难得是前面的660s?

最后补充一下,lmon主要是检查如下两种心跳;

1. 网络心跳 说白了,就是指cluster心跳(应该是通过ping机制来完成的)

2. controlfile心跳  ckpt进程会每隔3s去更新一次controlfile,用此来判断。

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC脑裂导致节点重启原因分析

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值