oracle宕了,一次Oracle rac宕机分析

最新推荐文章于 2021-04-04 06:20:21 发布

时常抠脚的隔壁老

最新推荐文章于 2021-04-04 06:20:21 发布

阅读量360

点赞数

文章标签： oracle宕了

某客户rac数据库2号节点实例自动宕节点，以下为分析报告一、现象回顾：2号节点发生故障时，alert日志显示如下：Thread 2 advanced to log sequence 77740 (LGWR switch)

Current log# 24 seq# 77740 mem# 0: /dev/rcrm4_rd2_91_2G

Current log# 24 seq# 77740 mem# 1: /dev/rcrm4_rd2_92_2G

Mon Mar 28 09:45:23 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:46:52 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:47:52 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:47:53 BEIST 2011

Thread 2 advanced to log sequence 77741 (LGWR switch)

Current log# 25 seq# 77741 mem# 0: /dev/rcrm5_rd2_101_2G

Current log# 25 seq# 77741 mem# 1: /dev/rcrm5_rd2_102_2G

Mon Mar 28 09:48:52 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:49:52 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:50:52 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:51:52 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:52:53 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:53:53 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:54:53 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:55:53 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:56:53 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:57:53 BEIST 2011

PMON failed to acquire latch, see PMON dump

Mon Mar 28 09:58:44 BEIST 2011

Received an instance abort message from instance 1 (reason 0x0)

Please check instance 1 alert and LMON trace files for detail.

LMD0: terminating instance due to error 481

Mon Mar 28 09:58:45 BEIST 2011

System state dump is made for local instance

System State dumped to trace file /opt/oracle/admin/crmdb/bdump/crmdb2_diag_3732132.trc

Mon Mar 28 09:58:45 BEIST 2011

Shutting down instance (abort)

License high water mark = 2290从alert日志来看，二号节点主要表现为PMON进程不能获得latch(PMON failed to acquire latch)，超时等待10分钟，之后接受到一号节点将二号节点宕机指令(Received an instance abort message from instance 1 (reason 0x0))，LMD0进程将实例abort(LMD0: terminating instance due to error 481)，并在宕机之前系统自动做了systemstate dump(System state dump is made for local instance)，宕机之前用户在2节点的会话数为2290个(License high water mark = 2290)。采用ass.awk工具分析crmdb2_diag_3732132.trc可以得到如下结果：Starting Systemstate 1

.........................................

Ass.Awk Version 1.0.9 - Processing crmdb2_diag_3732132.trc

System State 1

~~~~~~~~~~~~~~~~

1:

2: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]

3: last wait for 'DIAG idle wait'

4: waiting for 'rdbms ipc message'

5: last wait for 'ges lmd/lmses to freeze in rcfg - mrcvr'

6: last wait for 'ges remote message'

7: waiting for 'gc object scan'

8: last wait for 'gc object scan'

9: last wait for 'gc object scan'

10: last wait for 'gc object scan'

11: waiting for 'gc object scan'

12: last wait for 'gc object scan'

13: last wait for 'gc object scan'

14: last wait for 'gc object scan'

15: waiting for 'latch: object queue header operation'[Latch 700000d1fc80b20]

16: waiting for 'latch: cache buffers chains'[Latch 700000cdf26c988]

17: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]

18: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]

19: last wait for 'gc object scan'

20: last wait for 'gc object scan'

21: last wait for 'gc object scan'

22: waiting for 'latch: cache buffers chains'[Latch 700000cea8a6a80]

23: waiting for 'rdbms ipc message'

24: waiting for 'gcs drm freeze in enter server mode'

25: waiting for 'gcs drm freeze in enter server mode'

26: waiting for 'gcs drm freeze in enter server mode'

27: waiting for 'gcs drm freeze in enter server mode'

28: waiting for 'gcs drm freeze in enter server mode'

29: waiting for 'gcs drm freeze in enter server mode'

30: waiting for 'gcs drm freeze in enter server mode'

31: waiting for 'gcs drm freeze in enter server mode'

32: waiting for 'rdbms ipc message'

33: waiting for 'rdbms ipc message'

34: last wait for 'smon timer'

35: waiting for 'rdbms ipc message'

36: waiting for 'rdbms ipc message'

37: waiting for 'rdbms ipc message'

38: waiting for 'rdbms ipc message'

39:

40:

41: waiting for 'rdbms ipc message'

Blockers

~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource

then it will be given in square brackets. Below is a summary of the

waited upon resources, together with the holder of that resource.

Notes:

~~~~~

o A process id of '???' implies that the holder was not found in the

systemstate.

Resource Holder State

Latch 700000d1fe74fd0 ??? Blocker

Latch 700000d1fc80b20 ??? Blocker

Latch 700000cdf26c988 ??? Blocker

Latch 700000cea8a6a80 ??? Blocker

Object Names

~~~~~~~~~~~~

Latch 700000d1fe74fd0 Child object queue header operation

Latch 700000d1fc80b20 holding (efd=4) 700000d1fc80b20 Child obj

Latch 700000cdf26c988 Child cache buffers chains

Latch 700000cea8a6a80 Child cache buffers chains

762178 Lines Processed.进一步通过分析crmdb2_diag_3732132.trc文件(分析过程详见附录)，进程之间有如下关系：(1)当故障发生时，orapid=7占用着latch: object queue header operation，addr=700000d1fc80b20，orapid=8,10,11,12,13,14,15,19,21处于等待状态，且会话均为dead状态。(2)当故障发生时，orapid=9,20占用着latch: object queue header operation，addr=700000d1fe74fd0，orapid=2，17,18处于等待状态，其中2进程为alive，17,18进程均为dead状态。(3)当故障发生时orapid=16等待latch: cache buffers chains，addr=700000cdf26c988，进程为dead状态(4)当故障发生时orapid=12等待latch: cache buffers chains，addr=700000cdf26c988，进程为dead状态。(5)值得注意的是故障发生时dbwr进程均处于dead状态，等待事件如下：Dumping Session Wait History

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488295 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488293 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488291 sec

=1, =0, =0

for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec故障发生时各日志文件显示如下：(1)1号节点日志信息：crmdb1_lmon_4579776.trc显示，当1号节点LMON进程检测LMD进程到存在异常时(超时900秒)，即向2号节点发起请求(requesting memberkill of instances w/o ftdones)，2号节点收到之后由LMD完成宕库操作。*** 2011-03-28 09:58:43.010

* kjfcdrmrcfg: waited 901 secs (6501391,6502292,900) for lmd to receive all ftdones, requesting memberkill of instances w/o ftdones:

kjfsprn: sync status inst 0 tmout 900 (sec)

kjfsprn: sync propose inc 4 level 691378

kjfsprn: dmap ver 4 (step 0)

kjfsprn: sync inc 4 level 691378

kjfsprn: sync bitmap 0 1

DEFER MSG QUEUE ON LMD0 IS EMPTY

SEQUENCES:

0:0.0 1:210167409.0

alert_crmdb1.log显示：* kjfc_kill_no_sync: send abort msg to node 1 (noftdn)

Mon Mar 28 09:58:45 BEIST 2011(2)2号节点信息：crmdb2_lmd0_2154764.trc显示：2号节点lmd进程收到指令，准备宕节点，宕节点之前(5秒钟)使用diag进程进行systemstate dump。*** 2011-03-28 09:58:44.845

Received a special admin message (type 2) from node 0

Message body: flag 0x5 data 0x0 0x0 0x0

Abort the instance

ksuitm: waiting up to [5] seconds before killing DIAG(3732132)二、故障分析LMS(GLOBAL CACHE SERVICE PROCESS)进程主要用来管理集群内数据块的访问，并在不同实例的BUFFER CACHE中传输块镜像。LMS进程跨集群管理数据库的请求，并保证在所有实例的BUFFER CACHE中一个数据块的镜像只能出现一次。LMS进程靠着在实例中传递消息来协调数据块的访问，当一个实例请求数据块时，该实例的LMD进程发出一个数据块资源的请求，该请求只向MASTER数据块的实例的LMD进程，MASTER实例的LMD进程同时正在使用的实例的LMD进程释放该资源，这时拥有该资源的实例的LMS进程会创建一个数据块镜像的一致性读，然后把该数据块传递到请求该资源的实例的BUFFER CACHE中。LMS进程保证了在每一时刻只能允许一个实例去更新数据块，并负责保持该数据块的镜像纪录(包含更新数据块的状态FLAG)。在故障发生时，由进程状态关系图，可以看到LMS进程处于latch free(latch: object queue header operation和latch: cache buffers chains)且状态为dead状态(LMS0除外)。也就意味着1,2号节点之间block传输出现了问题。由于PMON进程和LMS进程相互争用latch(latch: object queue header operation),进而PMON获取不了latch，出现超时等待10分钟，也就无法恢复有问题的进程。LMD(GLOBAL ENQUEUE SERVICE DAEMON)进程主要管理对全局队列和资源的访问，并更新相应队列的状态，处理来自于其他实例的资源请求。当2号节点LMS进程出现问题时，无法更新相应队列状态，由日志可以看出2号节点处于空闲事件等待状态(ges remote message)LMON(GLOBAL ENQUEUE SERVICE MONITOR)进程主要监测群集内的全局队列和全局资源，管理实例和处理异常并相应的群集队列进行恢复操作。当1号节点LMON进程检测到2号节点出现异常时，发送killmember指令，由2号节点lmd进程完成abort操作。我们同时注意到故障发生时，ckpt，dbwr进程均处于dead状态，为保证一致性处于pending i/o状态，dbwr等待事件为gcs drm freeze in enter server mode，该等待事件在Oracle有如下解释(metalink doc：4755405.8)：Under heavy load, DRM sync can take very long as LMON is waiting for LMS

processes to empty their send queues. This wait is unnecessary as

correctness is already guaranteed by the sync channel flush mechanism.

This shows up as sessions having large "gcs drm freeze in enter server mode" wait times. DRM takes a long time (more than 5 minutes).也就是在在高负载到情况下，尤其是需要节点之间大规模传递一致性块时，DRM sync功能LMON会等待LMS进程清空发送队列，当然在LMS进程清空发送队列时可能会引起latch free等待，进而引起GES问题，而Oracle GES(GES包括LMON和LMD进程)检测到有问题时，将会abort实例(metalink：ID 9920699.8)。Oracle还指出DRM功能可能会超过5分钟。针对这一现象，Oracle定位为bug：6960699三、解决办法：屏蔽DRM功能，设置如下隐含参数_gc_affinity_time=0

_gc_undo_affinity=FALSE附录：经过awk格式化之后的文件可以看出latch 700000d1fc80b20成为首号怀疑对象，进而在跟踪文件crmdb2_diag_3732132.trc寻找相关信息。PROCESS 7:

----------------------------------------

SO: 700000d64f4f4c0, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=7, calls cur/top: 700000d69842360/700000d6983c508, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 24

last post sent-location: ksasnd

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=20

holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000009)

waiters [orapid (seconds since: put on list, posted, alive check)]:

15 (2, 1301277525, 2)

waiter count=1

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 3830616

OSD pid info: Unix process pid: 3830616, image: oracle@crmdb02 (LMS0)从红色标注可以看出process=7(LMS0)进程holding child latch#=7，addr=700000d1fc80b20的object queue header operation锁，从而导致了process=15进程等待。查看PROCESS=15状态：PROCESS 15:

----------------------------------------

SO: 700000d8cf24760, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=15, calls cur/top: 700000d698433f8/700000d6983db38, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 98

last post sent-location: kjbrcvdscn: scn acked

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=0

Location from where call was made: kclobj_2:

waiting for 700000d1fc80b20 Child object queue header operation level=5 child#=77

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000009)

waiters [orapid (seconds since: put on list, posted, alive check)]:

15 (2, 1301277525, 2)

waiter count=1

gotten 688460698 times wait, failed first 3268358 sleeps 1757249

gotten 19572 times nowait, failed: 109

possible holder pid = 7 ospid=3830616

on wait list for 700000d1fc80b20

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 3260452 (DEAD)

OSD pid info: Unix process pid: 3260452, image: oracle@crmdb02 (LMS8)从红色标注可以看出process=15(LMS8)进程正在等待child latch#=7，addr=700000d1fc80b20，且进程已处于DEAD状态。用操作系统命令查看，发现只有一个会话等待此child latch

[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "waiting for 700000d1fc80b20" |wc -l

1进一步查看共有9个会话holding此child latch。[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "holding (efd=4) 700000d1fc80b20" |wc -l

9查看PROCESS=8状态。PROCESS 8:

----------------------------------------

SO: 700000d8cf23f70, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=8, calls cur/top: 700000d69842620/700000d6983c7c8, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 98

last post sent-location: kjbrcvdscn: scn acked

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=20

holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000009)

waiters [orapid (seconds since: put on list, posted, alive check)]:

15 (2, 1301277525, 2)

waiter count=1

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 3503016 (DEAD)

OSD pid info: Unix process pid: 3503016, image: oracle@crmdb02 (LMS1)可以看出process=8(LMS1)状态等同于process=15。查看PROCESS=9，PROCESS=20状态PROCESS 9:

----------------------------------------

SO: 700000d60f24a78, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=9, calls cur/top: 700000d698428e0/700000d6983caa0, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 98

last post sent-location: kjbrcvdscn: scn acked

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=20

holding (efd=4) 700000d1fe74fd0 Child object queue header operation level=5 child#=91

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000002)

waiters [orapid (seconds since: put on list, posted, alive check)]:

18 (2, 1301277525, 2)

17 (2, 1301277525, 2)

waiter count=2

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 2982786 (DEAD)

OSD pid info: Unix process pid: 2982786, image: oracle@crmdb02 (LMS2)可以看到process=9(LMS2), process=20(LMSd),会话holding了addr=700000d1fe74fd0 child#91的child latch，进而导致了process=17(LMSa)和process=18(LMSb)等待。查看PROCESS=17状态PROCESS 17:

----------------------------------------

SO: 700000d5cf23598, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=17, calls cur/top: 700000d67b9d3f8/700000d6983e0d0, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 98

last post sent-location: kjbrcvdscn: scn acked

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=0

Location from where call was made: kclobj_2:

waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000002)

waiters [orapid (seconds since: put on list, posted, alive check)]:

18 (2, 1301277525, 2)

17 (2, 1301277525, 2)

waiter count=2

gotten 687558442 times wait, failed first 3419261 sleeps 1918111

gotten 19762 times nowait, failed: 151

possible holder pid = 9 ospid=2982786

on wait list for 700000d1fe74fd0

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 295048 (DEAD)

OSD pid info: Unix process pid: 295048, image: oracle@crmdb02 (LMSa)查看PROCESS=18状态PROCESS 18:

----------------------------------------

SO: 700000d58f42690, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=18, calls cur/top: 700000d69843f10/700000d6983e390, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 98

last post sent-location: kjbrcvdscn: scn acked

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=0

Location from where call was made: kclobj_2:

waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000002)

waiters [orapid (seconds since: put on list, posted, alive check)]:

18 (2, 1301277525, 2)

17 (2, 1301277525, 2)

waiter count=2

gotten 687558442 times wait, failed first 3419261 sleeps 1918111

gotten 19762 times nowait, failed: 151

possible holder pid = 9 ospid=2982786

on wait list for 700000d1fe74fd0

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 3428506

OSD pid info: Unix process pid: 3428506, image: oracle@crmdb02 (LMSb)可以看到process=17,18均等待addr=700000d1fe74fd0 child latch。PROCESS=10(LMS3)，PROCESS=11(LMS4)，PROCESS=12(LMS5)，PORCESS=13(LMS6)，PROCESS=14(LMS7),PROCESS=15(LMS8)，PROCESS=19(LMSc)，PROCESS=21(LMSe)状态均为类似状态，即进程holding child latch#=7，addr=700000d1fc80b20的object queue header operation锁，从而导致了process=15进程等待。查看PROCESS=16状态PROCESS 16:

----------------------------------------

SO: 700000d60f25268, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=16, calls cur/top: 700000d69843990/700000d6983ddf8, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 98

last post sent-location: kjbrcvdscn: scn acked

last process posted by me: 700000d58f43670 1 6

(latch info) wait_event=0 bits=0

Location from where call was made: kclgrantlk_2:

waiting for 700000cdf26c988 Child cache buffers chains level=1 child#=162223

Location from where latch is held: kcbchg: kslbegin: bufs not pinned:

Context saved from call: 562086705

state=busy(exclusive) (val=0x200000000000086f) holder orapid = 2159

waiters [orapid (seconds since: put on list, posted, alive check)]:

16 (140, 1301277525, 2)

waiter count=1

gotten 4188007 times wait, failed first 98161 sleeps 554

gotten 461256 times nowait, failed: 124

possible holder pid = 2159 ospid=3392358

on wait list for 700000cdf26c988

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 3560010 (DEAD)查看process=22状态PROCESS 22:

----------------------------------------

SO: 700000d8cf24f50, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=22, calls cur/top: 700000d67b9dc38/700000d6983eea8, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 94

last post received-location: KJCS Post snd proxy to flush msg

last process to post me: 700000d58f43670 1 6

last post sent: 0 0 135

last post sent-location: kclcget

last process posted by me: 700000d60f53498 3 1

(latch info) wait_event=0 bits=0

Location from where call was made: kclexpandlock:

waiting for 700000cea8a6a80 Child cache buffers chains level=1 child#=61810

Location from where latch is held: kcbzgb: scan from tail. nowait:

Context saved from call: 0

state=busy(exclusive) (val=0x2000000000000456) holder orapid = 1110

waiters [orapid (seconds since: put on list, posted, alive check)]:

22 (279, 1301277525, 2)

waiter count=1

gotten 5121038 times wait, failed first 15184 sleeps 753

gotten 586404 times nowait, failed: 1427022

possible holder pid = 1110 ospid=2978810

on wait list for 700000cea8a6a80

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 3109298 (DEAD)

OSD pid info: Unix process pid: 3109298, image: oracle@crmdb02 (LMSf)最后检查process=2状态PROCESS 2:

----------------------------------------

SO: 700000d60f24288, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=2, calls cur/top: 700000d6983b730/700000d6983b730, flag: (e) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 504403214633274144 133 2

last post received-location: kslges

last process to post me: 700000d5cf85938 137 1

last post sent: 0 0 121

last post sent-location: kcbzww

last process posted by me: 700000d58f7f6e0 198 1

(latch info) wait_event=0 bits=6

Location from where call was made: kcbw_unlink_q:

waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91

Location from where latch is held: kclobj_2:

Context saved from call: 0

state=busy(shared) (val=0x4000000000000002)

waiters [orapid (seconds since: put on list, posted, alive check)]:

18 (2, 1301277525, 2)

17 (2, 1301277525, 2)

waiter count=2

gotten 687558442 times wait, failed first 3419261 sleeps 1918111

gotten 19762 times nowait, failed: 151

possible holder pid = 9 ospid=2982786

holding (efd=8) 700000cf3a6e170 Child cache buffers lru chain level=2 child#=91

Location from where latch is held: kcbzfw: buffer header

Context saved from call: 0

state=busy, wlstate=free

waiters [orapid (seconds since: put on list, posted, alive check)]:

63 (50, 1301277525, 2)

waiter count=1

holding (efd=8) 700000ce2a79ef0 Child cache buffers chains level=1 child#=81019

Location from where latch is held: kcbso1: set no access:

Context saved from call: 0

state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2

Process Group: DEFAULT, pseudo proc: 700000d6919aea0

O/S info: user: oracle, term: UNKNOWN, ospid: 1274566

OSD pid info: Unix process pid: 1274566, image: oracle@crmdb02 (PMON)

时常抠脚的隔壁老

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
oracle宕了,一次Oracle rac宕机分析

某客户rac数据库2号节点实例自动宕节点，以下为分析报告一、现象回顾：2号节点发生故障时，alert日志显示如下：Thread 2 advanced to log sequence 77740 (LGWR switch)Current log# 24 seq# 77740 mem# 0: /dev/rcrm4_rd2_91_2GCurrent log# 24 seq# 77740 mem# 1: ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。