红色警报 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启

      ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启。该问题是BUG 10194190 18740837

导致的,修复该问题需要打patch 25142535。有一客户的solaris oracle 11.2.0.4 rac所有节点的DB已经打上补丁,但是

主机运行248天后,跑在上边的Oracle 11.2.0.4依然因集群心跳丢失而重启。根据ORACLE MOS官方回复,修复该BUG

的一个补丁包 patch 18740837是需要同时在GI软件上应用。

      下边是相关客户案例的问题分析及解决处理总结。

       1、问题节点DB告警日志报错提示

    2、问题节点ASM告警日志报错提示

      3、 +ASM2_lmhb_4609_i78497.trc文件内容

      4、OCSSD日志提示心跳超时

      5、+ASM2_lmhb_4609_i78497.trc文件局部信息

===[ Session State Object ]===

  ----------------------------------------

  SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

   proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0

  (session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258

            flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

            flags2: (0x409) -/-/INC

            DID: , short-term DID:

            txn branch: 0x0

            edition#: 0            oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS

  ksuxds FALSE at location: 0

  service name: SYS$BACKGROUND

  Current Wait Stack:

   0: waiting for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476951 seq_num=11521 snap_id=1

      wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec

      wait times: max=0.100000 sec, heur=2 min 46 sec

      wait counts: calls=1 os=1

      in_wait=1 iflags=0x5a8

  Wait State:

    fixed_waits=0 flags=0x22 boundary=0x0/-1

  Session Wait History:

      elapsed time of 0.000015 sec since current wait

   0: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476950 seq_num=11520 snap_id=1

      wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000138 sec of elapsed time

   1: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476949 seq_num=11519 snap_id=1

      wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000015 sec of elapsed time

   2: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476948 seq_num=11518 snap_id=1

      wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000133 sec of elapsed time

   3: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476947 seq_num=11517 snap_id=1

      wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

   4: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476946 seq_num=11516 snap_id=1

      wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000119 sec of elapsed time

   5: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476945 seq_num=11515 snap_id=1

      wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000012 sec of elapsed time

   6: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476944 seq_num=11514 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000152 sec of elapsed time

   7: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476943 seq_num=11513 snap_id=1

      wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000016 sec of elapsed time

   8: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476942 seq_num=11512 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000272 sec of elapsed time

   9: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476941 seq_num=11511 snap_id=1

      wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

  Sampled Session History of session 145 serial 1

  ---------------------------------------------------

  The sampled session history is constructed by sampling

  the target session every 1 second. The sampling process

  captures at each sample if the session is in a non-idle wait,

  an idle wait, or not in a wait. If the session is in a

  non-idle wait then one interval is shown for all the samples

  the session was in the same non-idle wait. If the

  session is in an idle wait or not in a wait for

  consecutive samples then one interval is shown for all

  the consecutive samples. Though we display these consecutive

  samples  in a single interval the session may NOT be continuously

  idle or not in a wait (the sampling process does not know).

  The history is displayed in reverse chronological order.

  sample interval: 1 sec, max history 120 sec

  ---------------------------------------------------

    [121 samples,                                          21:08:20 - 21:10:20]

      idle wait at each sample

  temporary object counter: 0

    ----------------------------------------

    Virtual Thread:

    kgskvt: 3fc434ce8, sess: 3ffdb38d8 sid: 145 ser: 1

    vc: 0, proc: 400b0c258, id: 145

    consumer group cur:  (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:

    vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1

    inwait: 0, short wait event: 0 posted_run: 0

    location where insched last set: kgskthrrun

    location where insched last cleared: kgskthrrun1

    location where inwait last set: NULL

    location where inwait last cleared: NULL

    is_assigned: 0, in_sched: 0 (0)

    qcls: 0, qlink: FALSE

    vt_active: 0 (pending: 0)

    vt_pq_active: 0, dop: 0

    used quanta (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu start time: 0

    idle time: 0, active time: 0 (cg: 0)

    cpu yields:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu wait time (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io wait time:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    ASL queued time outs: 0, time: 0 (cur 0, cg 0)

    PQQ queued time outs: 0, time: 0 (cur 0, cg 0)

    Queue timeout violation: 0

    calls aborted: 0, num est exec limit hit: 0

    undo current: 0k max: 0k

    I/O credits acquired:small=0 large=0

    I/O credits waiting for:small=0 large=0

    KTU Session Commit Cache Dump for IDLs:

    KTU Session Commit Cache Dump for Non-IDLs:

    ----------------------------------------

    KKS-UOL used : 0 locks(used=0, free=0)

    KGX Atomic Operation Log 3eae04a58

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04aa8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04af8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04b48

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    ----------------------------------------

    KGL-UOL SO Cache(total=0, free=0)

    KGX Atomic Operation Log 3eae047a0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae047f8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04850

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae048a8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04900

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04958

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae049b0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    ----------------------------------------

    SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

     proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0

    (dummy) nxc=0, nlb=0  

===[ Callstack ]===

*** 2019-05-21 21:10:20.275

Process diagnostic dump for oracle@sm0ora10 (LMON), OS id=4604,

pid: 9, proc_ser: 1, sid: 145, sess_ser: 1

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

  0.000000 secs at [ 21:10:19 ]

    NOTE: scheduling delay has not been sampled for 0.505539 secs  0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg

  0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg

  0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg

*** 2019-05-21 21:10:21.374

loadavg : 20.48 16.00 13.13

swap info: free_mem = 194124.91M rsv = 258509.04M

           alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 O     grid  4604  4596   1  79 20        ? 168184            Sep 15 ?         781:33 asm_lmon_+ASM2

Short stack dump:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<- thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224 <-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

*** 2019-05-21 21:10:22.421

==============================

LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

----- End of Abridged Call Stack Trace -----

    6、主机启动时间

root@0ora10# hostname

ora10

root@ora10# uname -a

SunOS ora10 5.11 11.3 sun4v sparc sun4v

root@# uptime

 11:28pm  up 248 day(s) , 15:48,  8 users,  load average: 6.21, 7.27, 8.09

    6、根据ORA-15046和ORA-29770查询和TRC 堆栈调用信息提示:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<- thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224 <-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380 ORACLE MOS官网,发现完全匹配上: (Doc ID 2159643.1)

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770 

    7、后续查看问题系统补丁安装情况,发现rac所有节点的DB已经应用上 Doc ID 2159643.1 提到的补丁 25142535

oracle@ora10$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

oracle@ora10$

oracle@ora11$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

oracle@ora11$

      但是GRID GI没有应用补丁  25142535

grid@ora10$ opatch lsinventory|grep  25142535

grid@ora10$ 

grid@ora11$ opatch lsinventory|grep  25142535

grid@ora11$ 

      8、就此问题咨询MOS,ORACLE MOS给出的官方回复也没明确提示补丁 25142535确实是否需要在GI上应用

      9、补丁 25142535是补丁 18740837和  10194190的合集,其中 18740837小补丁的readme中有提示RAC环境需要在GI应用

    10、问题分析结论:

    由目前的实际问题情况及8和9的信息汇总,可以确定 补丁 25142535需要在oracle 11.2.0.4 rac for solaris的所有节点的

GI集群软件上应用。可以肯定的是:当前的 oracle 11.2.0.4 rac for solaris asm and db crash and restart确实是rac在

solaris上运行248天导致的,在集群所有节点DB软件上打补丁25142535不能解决,官方也没有明确说在GI软件上同时应

用补丁25142535后到底能否解决:

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For >  248  Days by LMHB with ORA-29770 

    

    如果有类似问题的客户看到博文,如果有通过GI打补丁25142535解决此问题,或者通过其他途径解决此问题欢迎:

评论留言,感谢!

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29357786/viewspace-2645319/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/29357786/viewspace-2645319/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值