SystemState分析案例(一)

这是1个09年的故障,数据库无法登陆,杀了发现有问题的进程,以及所有LOCAL=NO的还是不行,最后直接重启了库
今天想总结下systemstate,把以前的systemstate拿出来重新分析整理了下,发现了root cause

通过分析trc,发现大量的进程正在等待latch:library cache,一般这是严重的解析问题造成。但是从数据库以前的awr看,没有严重的解析问题

获得systemstate,随意找了个正在等待latch:library cache的进程看,其在请求7000001302e0170 这个Child library cache,可能的阻塞者为Process 51

      waiting for 7000001302e0170 Child library cache level=5 child#=9
        Location from where latch is held: kghfrunp: clatch: wait:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           93 (942, 1258509046, 2)
       ......
           8 (735, 1258509046, 2)
           waiter count=28
          gotten 444254233 times wait, failed first 66211 sleeps 14231
          gotten 6228862 times nowait, failed: 40880
        possible holder pid = 51 spid=1331410
      on wait list for 7000001302e0170

那么是谁持有 7000001302e0170 这个 latch?搜索 [ hold.*7000001302e0170 ],发现Process 51持有7000001302e0170,并且,其在请求7000000100e5020 这个 1# child shared pool latch,可能的阻塞者为Process 92

PROCESS 51:
  ----------------------------------------
  SO: 70000013550d080, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=51, calls cur/top: 0/7000001000e1260, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=20
        Location from where call was made: kghfrunp: alloc: session dur:
      waiting for 7000000100e5020 Child shared pool level=7 child#=1
        Location from where latch is held: kgh: add extent to reserved list:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           27 (915, 1258509034, 2)
           ......
           629 (468, 1258509034, 2)
           waiter count=259
          gotten 1006956901 times wait, failed first 239663 sleeps 97559
          gotten 0 times nowait, failed: 0
        possible holder pid = 92 spid=929882
      on wait list for 7000000100e5020
      holding    (efd=6) 7000001302e0170 Child library cache level=5 child#=9
        Location from where latch is held: kghfrunp: clatch: wait:

搜索 [ hold.*7000000100e5020 ],没有发现谁持有该latch,那么看看可能的阻塞者Process 92

PROCESS 92:
  ----------------------------------------
  SO: 70000013850e348, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=92, calls cur/top: 700000118a884b8/700000118a884b8, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 187
              last post received-location: kglpndl: post after freeing latch
              last process to post me: 7000001384f80e8 1 14
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=80
      holding    (efd=9) 700000010021a20 Parent+children shared pool level=7
        Location from where latch is held: kghfrunp: alloc: clatch nowait:
        Context saved from call: 0
        state=busy, wlstate=free
    Process Group: DEFAULT, pseudo proc: 7000001385b4978
    O/S info: user: oracle, term: UNKNOWN, ospid: 929882
    OSD pid info: Unix process pid: 929882, image: oracleXXXXX@oracle
    Short stack dump: unable to dump stack due to error 72
    ----------------------------------------
    SO: 7000001386a6978, type: 4, owner: 70000013850e348, flag: INIT/-/-/0x00
    (session) sid: 371 trans: 0, creator: 70000013850e348, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0001-005C-00006C20, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 70000007be42f48, psql: 700000136139f00, user: 196/XXXXX5
    O/S info: user: , term: , ospid: 1234, machine: was0
              program:
    last wait for 'latch: shared pool' blocking sess=0x0 seq=284 wait_time=283476 seconds since wait started=922
                address=7000000100e5020, number=d5, tries=1d
    Dumping Session Wait History
     for 'latch: shared pool' count=1 wait_time=283476
                address=7000000100e5020, number=d5, tries=1d
     for 'latch: shared pool' count=1 wait_time=292978
                address=7000000100e5020, number=d5, tries=1c
     for 'latch: shared pool' count=1 wait_time=292977
                address=7000000100e5020, number=d5, tries=1b
     for 'latch: shared pool' count=1 wait_time=292978
                address=7000000100e5020, number=d5, tries=1a
     for 'latch: shared pool' count=1 wait_time=292978
                address=7000000100e5020, number=d5, tries=19
     for 'latch: shared pool' count=1 wait_time=292977
                address=7000000100e5020, number=d5, tries=18
     for 'latch: shared pool' count=1 wait_time=292978
                address=7000000100e5020, number=d5, tries=17
     for 'latch: shared pool' count=1 wait_time=292979
                address=7000000100e5020, number=d5, tries=16
     for 'latch: shared pool' count=1 wait_time=293120
                address=7000000100e5020, number=d5, tries=15
     for 'latch: shared pool' count=1 wait_time=294266
                address=7000000100e5020, number=d5, tries=14
    temporary object counter: 0

其持有Parent shared pool latch ,并且其请求过7000000100e5020这个1# child shared pool latch

当时我的判断为该进程自死锁,除非BUG,latch上应该是不会死锁的。不过杀了所有连接进程数据库也没有恢复,最后强制重启

今天问了下老熊,他说进程可以在请求父latch的同时,再去请求2个子latch
并且这里是Dumping Session Wait History,并且也没有waiting for,正在等待的信息,也就是说进程现在可能已经获得了该child latch。

接着我重新看了下trc,发现2# child shared pool latch 也有个大量进程在等待获得
并且,在这2个资源的wait list上,也没有发现有Process 92
那么就符合老熊说的可能性,进程同时请求了父latch和1#,2#子latch,并长时间占用不释放,从而阻塞了其他进程
  holding    (efd=9) 700000010021a20 Parent+children shared pool level=7
这个也应该是表明进程已经获得了父+子 shared pool latch

同时,我注意到,该进程在dump short stack报错
Short stack dump: unable to dump stack due to error 72
ORA-72的意思为进程已经不存在,无法dump。就是说这个时候进程已经异常。

既然进程已经出现异常,甚至死亡,那么这个时候PMON该出来干事情了啊,那么PMON这个时候在干什么?

PROCESS 2:
  ----------------------------------------
  SO: 7000001384f80e8, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=2, calls cur/top: 70000013583ce90/70000013583ce90, flag: (e) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 15
              last post received-location: ksupsc
              last process to post me: 7000001384f90a8 1 6
              last post sent: 0 0 187
              last post sent-location: kglpndl: post after freeing latch
              last process posted by me: 70000013850e348 205 0
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 7000001385b4978
    O/S info: user: oracle, term: UNKNOWN, ospid: 397448
    OSD pid info: Unix process pid: 397448, image: oracle@oracle (PMON)
    Short stack dump:
ksdxfstk+002c
    ----------------------------------------
    SO: 7000001388e24b0, type: 11, owner: 7000001384f80e8, flag: INIT/-/-/0x00
    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 7000001384f80e8,
                       event: 1, last message event: 13,
                       last message waited event: 13, messages read: 1
                       channel: (700000138919d38) scumnt mount lock
                                scope: 1, event: 22, last mesage event: 13,
                                publishers/subscribers: 0/13,
                                messages published: 1
    ----------------------------------------
    SO: 7000001387d49c0, type: 4, owner: 7000001384f80e8, flag: INIT/-/-/0x00
    (session) sid: 849 trans: 0, creator: 7000001384f80e8, flag: (51) USR/- BSY/-/-/-/-/-
              DID: 0001-0002-00000004, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
    last wait for 'cursor: mutex X' blocking sess=0x0 seq=30347 wait_time=1 seconds since wait started=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccb92
    Dumping Session Wait History
     for 'cursor: mutex X' count=1 wait_time=1
                idn=ed27757c, value=22300000000, where|sleeps=5bccb95
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccb94
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccb9d
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccb9c
     for 'cursor: mutex X' count=1 wait_time=1
                idn=ed27757c, value=22300000000, where|sleeps=5bccb9b
     for 'cursor: mutex X' count=1 wait_time=1
                idn=ed27757c, value=22300000000, where|sleeps=5bccba4
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccba3
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccbac
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccbab
     for 'cursor: mutex X' count=1 wait_time=0
                idn=ed27757c, value=22300000000, where|sleeps=5bccbaa
    temporary object counter: 0

PMON曾经在cursor: mutex X上等待。但是这里没有waiting for的信息
但是从Short stack上看,PMON当前也应该正在请求mutex。
为了确认这一点,我搜索 [ mutex.*ed27757c ],确定PMON的确正在以EXCL模式请求该mutex,阻塞者为SID 547

      KGX Atomic Operation Log 700000136bed8d8
       Mutex 700000049974b98(547, 0) idn ed27757c oper GET_EXCL
       Cursor Parent uid 849 efd 8 whr 12 slp 23941
       per=kksAllocCursorStat [KKSSTALOC1] pt1=70000011cae22d0 pt2=70000011cae2338 pt3=0
       pt4=0 u41=0 stt=0

并且Process 111以EXCL模式持有该mutex

      KGX Atomic Operation Log 70000005eeb5f30
       Mutex 700000049974b98(547, 0) idn ed27757c oper EXCL
       Cursor Parent uid 547 efd 6 whr 2 slp 0
       per=PARENT_ALLOC_CHILD pt1=0 pt2=70000011c24bd90 pt3=0
       pt4=0 u41=0 stt=0

那么进程111在干啥子了?其在请求7000000100e5020,这个1# shared pool child latch

PROCESS 111:
  ----------------------------------------
  SO: 70000013551bcc0, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=111, calls cur/top: 700000118c44568/700000118c44568, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 187
              last post received-location: kglpndl: post after freeing latch
              last process to post me: 7000001384f80e8 1 14
              last post sent: 0 0 9
              last post sent-location: ksqrcl
              last process posted by me: 700000138510aa8 13 0
    (latch info) wait_event=0 bits=0
        Location from where call was made: kghalo:
      waiting for 7000000100e5020 Child shared pool level=7 child#=1
        Location from where latch is held: kgh: add extent to reserved list:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           27 (923, 1258509042, 1)
           ......
           629 (476, 1258509042, 1)
           waiter count=259
          gotten 1006956901 times wait, failed first 239663 sleeps 97559
          gotten 0 times nowait, failed: 0
        possible holder pid = 92 spid=929882
      on wait list for 7000000100e5020
    Process Group: DEFAULT, pseudo proc: 7000001385b4978
    O/S info: user: oracle, term: UNKNOWN, ospid: 680104
    OSD pid info: Unix process pid: 680104, image: oracleXXXXX@oracle
   
到这里现象很明显了:
  Process 92持有来父SP和1,2号子SP,但是持有后其出现了异常,所以其一直持有这些latch不释放。
  这时需要PMON来对这个进程进行清理,但是这个时候,PMON却在请求一个mutex,而该mutex又被Process 111持有
  Process 111又在请求1# child SP latch。该latch又被Process 92持有了
  这时就出现了一个latch + mutex的死锁

没有找到准确的BUG,但是有一些类似情况指出,可能有这样的BUG:
  PMON以X的模式请求一个mutex,而该mutex被一个死亡的进程持有,并且该进程正在等待PMON去清理

这样的情况下,用kill -9 杀掉111进程,那么将导致进程死亡,等待PMON去释放其持有的资源,而PMON却在等待111持有的mutex,还是不能解决问题
可能可以通过直接kill,让程序自然终止,自己释放持有的资源,但是这要有情况测试才知道

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/8242091/viewspace-736142/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/8242091/viewspace-736142/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值