oracle WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK问题分析

文章详细分析了一个OracleRAC数据库实例因WAITEDTOOLONGFORAROWCACHEENQUEUELOCK错误导致的HANG问题,指出在故障发生前一小时已存在cursor:pinSwaitonXLatch:rowcacheobjects等待,sharedpool从4800M收缩至2900M,且近2GB内存被KGH:NOACCESS占用。ALERT日志显示LCK0进程在等待与kghfrunp相关的latch,该功能涉及sharedpool内存释放。建议包括优化SGA内存参数、数据库组件设置以及关闭DRM相关参数以解决问题,并强调加强监控的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  • 服务概述

财务系统出现业务卡顿,数据库实例2遇到>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误导致业务HANG住。

对此HANG的原因进行分析:

故障发生时,未有主机监控数据,从可以获取的数据库AWR中,可以发现在故障发生前一小时,已经出现有较多cursor: pin S wait on X Latch: row cache objects等待;而从AWR报告中也可以看到问题时段的shared pool 从4800M收缩为了2900M,查看SGA组件变动使用情况,可以发现将近2个GB是被KGH: NO ACCESS所使用,即处于shrik/grow的中间状态。

此时ALERT日志中开始出现有>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误,查看此时的System State dumped TRACE文件,核心进程在等待latch: row cache objects资源,根据故障时ALERT日志中systemstate dump prod2_ora_3191.trc信息来看,以典型的LCK0进程为例,此进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'、latch: row cache objects;kghfrunp(kghfrunp相关的功能是需要从shared pool中释放空间相关)有关的latch。综合考虑,从现在得到的信息来看,很可能发生问题时shared pool的使用出现了问题。

  • RAC数据库实例HANG分析

4.1 数据库实例日志的分析

4.1.1故障节点数据库实例Alert日志

节点1数据库实例的Alert日志

Fri May 31 12:11:03 2019

Incremental checkpoint up to RBA [0x16b38.55e5.0], current log tail at RBA [0x16b38.7949.0]

Fri May 31 12:31:15 2019

Incremental checkpoint up to RBA [0x16b38.d76f.0], current log tail at RBA [0x16b38.d7a1.0]

Fri May 31 12:38:24 2019

GES: Potential blocker (pid=4099) on resource CI-0000001E-00000002;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 12:49:04 2019

Beginning log switch checkpoint up to RBA [0x16b39.2.10], SCN: 6163266145639

Thread 2 advanced to log sequence 92985

  Current log# 8 seq# 92985 mem# 0: +CWDATA/prod/onlinelog/group_8.525.896335299

  Current log# 8 seq# 92985 mem# 1: +CWDATA/prod/onlinelog/group_8.628.896335301

Fri May 31 12:49:09 2019

LNS: Standby redo logfile selected for thread 2 sequence 92985 for destination LOG_ARCHIVE_DEST_2

Fri May 31 12:49:09 2019

Completed checkpoint up to RBA [0x16b39.2.10], SCN: 6163266145639

Fri May 31 12:51:21 2019

Incremental checkpoint up to RBA [0x16b39.19.0], current log tail at RBA [0x16b39.19.0]

Fri May 31 12:54:11 2019

GES: Potential blocker (pid=4099) on resource CI-00000046-00000002;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 13:02:58 2019

GES: Potential blocker (pid=3991) on resource DR-00000000-00000000;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 13:11:23 2019

Incremental checkpoint up to RBA [0x16b39.e1.0], current log tail at RBA [0x16b39.e1.0]

Fri May 31 13:14:13 2019

GES: Potential blocker (pid=4099) on resource CI-0000001E-00000002;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 13:25:34 2019

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=287

System State dumped to trace file /oracle/product/10.2.0/db/rdbms/log/prod2_ora_3191.trc

Fri May 31 13:31:25 2019

Incremental checkpoint up to RBA [0x16b39.1ad.0], current log tail at RBA [0x16b39.1ad.0]

Fri May 31 13:33:01 2019

WARNING: inbound connection timed out (ORA-3136)

Fri May 31 13:33:01 2019

WARNING: inbound connection timed out (ORA-3136)

Fri May 31 13:33:01 2019

WARNING: inbound connection timed out (ORA-3136)

对故障时间段的alert日志进行分析,可以发现当时节点1的ALERT日志中出现大量如下告警信息:

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=***

4.2 对dump信息的分析

节点2的ALERT日志中提到了大量System State dumped文件,对此类TRACE文件进行了分析。如下:

使用发生问题后产生的systemstate dump prod2_ora_3191.trc

[oracle@bys1 ~]$ awk -f ass109.awk prod2_ora_3191.trc



Starting Systemstate 1

.............................................

Ass.Awk Version 1.0.9 - Processing prod2_ora_3191.trc



System State 1

~~~~~~~~~~~~~~~~

1:                                      

2:  waiting for 'pmon timer'            wait

3:  waiting for 'DIAG idle wait'        wait

4:  waiting for 'rdbms ipc message'     wait

5:  waiting for 'rdbms ipc message'     wait

6:  waiting for 'ges remote message'    wait

7:  waiting for 'gcs remote message'    wait

8:  waiting for 'gcs remote message'    wait

9:  waiting for 'gcs remote message'    wait

10: waiting for 'gcs remote message'    wait

11: waiting for 'gcs remote message'    wait

12: waiting for 'gcs remote message'    wait

13: waiting for 'gcs remote message'    wait

14: waiting for 'gcs remote message'    wait

15: waiting for 'gcs remote message'    wait

16: waiting for 'gcs remote message'    wait

17: waiting for 'gcs remote message'    wait

18: waiting for 'gcs remote message'    wait

19: waiting for 'gcs remote message'    wait

20: waiting for 'gcs remote message'    wait

21: waiting for 'gcs remote message'    wait

22: waiting for 'gcs remote message'    wait

23: waiting for 'gcs remote message'    wait

24: waiting for 'gcs remote message'    wait

25: waiting for 'gcs remote message'    wait

26: waiting for 'gcs remote message'    wait

27: waiting for 'rdbms ipc message'     wait

28: waiting for 'rdbms ipc message'     wait

29: waiting for 'rdbms ipc message'     wait

30: waiting for 'rdbms ipc message'     wait

31: waiting for 'rdbms ipc message'     wait

32: waiting for 'rdbms ipc message'     wait

33: waiting for 'rdbms ipc message'     wait

34: waiting for 'rdbms ipc message'     wait

35: waiting for 'rdbms ipc message'     wait

36: waiting for 'rdbms ipc message'     wait

37: waiting for 'rdbms ipc message'     wait

38: waiting for 'rdbms ipc message'     wait

39: waiting for 'rdbms ipc message'     wait

40: waiting for 'latch: row cache objects'[Latch 33a5d5bd0] wait

41: waiting for 'cursor: pin S wait on X' wait

42: waiting for 'latch: shared pool'   [Latch 33ba89b70] wait

43: waiting for 'rdbms ipc reply'       wait

44: waiting for 'rdbms ipc message'     wait

45: waiting for 'latch: row cache objects'[Latch 33a5d5bd0] wait

Blockers

~~~~~~~~



        Above is a list of all the processes. If they are waiting for a resource

        then it will be given in square brackets. Below is a summary of the

        waited upon resources, together with the holder of that resource.

        Notes:

        ~~~~~

         o A process id of '???' implies that the holder was not found in the

           systemstate.



                    Resource Holder State

             Latch 33a5d5bd0    ??? Blocker

             Latch 33ba89b70    ??? Blocker



Object Names

~~~~~~~~~~~~

Latch 33a5d5bd0 Child row cache objects       

Latch 33ba89b70 Child library cache           





167446 Lines Processed.

 latch: row cache objects等待具体信息


LCK0进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'latch: row cache objects
 

PROCESS 45:

  ----------------------------------------

  SO: 0x357240660, type: 2, owner: (nil), flag: INIT/-/-/0x00

  (process) Oracle pid=45, calls cur/top: 0x33472fae8/0x33472fae8, flag: (6) SYSTEM

            int error: 0, call error: 0, sess error: 0, txn error 0

  (post info) last post received: 0 0 24

              last post received-location: ksasnd

              last process to post me: 35224fdd8 39 0

              last post sent: 0 0 87

              last post sent-location: kjmdrmchk: drm_hb ack

              last process posted by me: 355269da8 1 6

    (latch info) wait_event=0 bits=0

        Location from where call was made: kqrbip:

      waiting for 33a5d5bd0 Child row cache objects level=4 child#=16

        Location from where latch is held: kghfrunp: clatch: wait:  ====>>> 可以看到这个latch是因为kghfrunp(KGH: Ask client to free unpinned space)相关的功能发起的,而这个功能是需要从shared pool中释放空间有关

        Context saved from call: 0

        state=busy, wlstate=free

          waiters [orapid (seconds since: put on list, posted, alive check)]:

           50 (1, 1559280342, 1)

           45 (1, 1559280342, 1)

           waiter count=2

          gotten 540727585 times wait, failed first 2224239 sleeps 386931

          gotten 3781484 times nowait, failed: 419225

        possible holder pid = 223 ospid=30412

      on wait list for 33a5d5bd0

    Process Group: DEFAULT, pseudo proc: 0x3552a2308

    O/S info: user: oracle, term: UNKNOWN, ospid: 4099

    OSD pid info: Unix process pid: 4099, image: oracle@findbb (LCK0)

    Short stack dump: ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+111<-<0x311720eb10>

此进程的历史等待事件:    

----------------------------------------

    SO: 0x3553c5998, type: 4, owner: 0x357240660, flag: INIT/-/-/0x00

    (session) sid: 1537 trans: (nil), creator: 0x357240660, flag: (51) USR/- BSY/-/-/-/-/-

              DID: 0000-002D-00000003, short-term DID: 0000-0000-00000000

              txn branch: (nil)

              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS

    waiting for 'latch: row cache objects' blocking sess=0x(nil) seq=30395 wait_time=0 seconds since wait started=0

                address=33a5d5bd0, number=c7, tries=2

    Dumping Session Wait History

     for 'latch: row cache objects' count=1 wait_time=41

                address=33a5d5bd0, number=c7, tries=1

     for 'latch: row cache objects' count=1 wait_time=422982

                address=33a5d5bd0, number=c7, tries=0

     for 'latch: shared pool' count=1 wait_time=28

                address=600ebf80, number=d5, tries=0

     for 'latch: shared pool' count=1 wait_time=57

                address=600ebf80, number=d5, tries=0

     for 'latch: row cache objects' count=1 wait_time=135277

                address=33a5d5bd0, number=c7, tries=0

     for 'latch: shared pool' count=1 wait_time=29

                address=600ebf80, number=d5, tries=1

     for 'latch: shared pool' count=1 wait_time=37

                address=600ebf80, number=d5, tries=0

     for 'latch: shared pool' count=1 wait_time=78

                address=600ebf80, number=d5, tries=0

     for 'latch: row cache objects' count=1 wait_time=47585

                address=33a5d5bd0, number=c7, tries=0

     for 'latch: row cache objects' count=1 wait_time=45842

                address=33a5d5bd0, number=c7, tries=0

    temporary object counter: 0

      ----------------------------------------

      UOL used : 0 locks(used=0, free=0)

      KGX Atomic Operation Log 0x35afaaab8

       Mutex (nil)(0, 0) idn 7fff00000000 oper NONE

       Cursor Pin uid 1537 efd 3 whr 11 slp 0

      KGX Atomic Operation Log 0x35afaab00

       Mutex (nil)(0, 0) idn 7fff00000000 oper NONE

       Library Cache uid 1537 efd 0 whr 0 slp 0

      KGX Atomic Operation Log 0x35afaab48

       Mutex (nil)(0, 0) idn 7fff00000000 oper NONE

       Library Cache uid 1537 efd 0 whr 0 slp 0

      ----------------------------------------

      SO: 0x33c58c458, type: 41, owner: 0x3553c5998, flag:

4.3 数据库AWR信息分析

数据库实例在11点时已经HANG住,未生成AWR快照。因此使用9-10点的快照来查看故障发生前是否存在异常。

  

 

从AWR可以看到shared pool 从4800M收缩为了2900M,查看SGA组件变动使用情况,可以发现将近2个GB是被KGH: NO ACCESS所使用,即处于shrik/grow的中间状态。

  • 总结与建议

针对此次RAC数据库实例问题,总结及建议如下:

5.1 问题描述与分析

此套RAC上运行的数据库实例,数据库版本为10.2.0.3.0;

其中数据库实例2遇到>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误导致业务HANG住。

对此HANG的原因进行分析:

故障发生时,未有主机监控数据,从可以获取的数据库AWR中,可以发现在故障发生前一小时,已经出现有较多cursor: pin S wait on X Latch: row cache objects等待;而从AWR报告中也可以看到问题时段的shared pool 从4800M收缩为了2900M,查看SGA组件变动使用情况,可以发现将近2个GB是被KGH: NO ACCESS所使用,即处于shrik/grow的中间状态。

此时ALERT日志中开始出现有>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误,查看此时的System State dumped TRACE文件,核心进程在等待latch: row cache objects资源,根据故障时ALERT日志中systemstate dump prod2_ora_3191.trc信息来看,以典型的LCK0进程为例,此进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'、latch: row cache objects;kghfrunp(kghfrunp相关的功能是需要从shared pool中释放空间相关)有关的latch。综合考虑,从现在得到的信息来看,很可能发生问题时shared pool的使用出现了问题。

5.2 后续处理方法与建议

1. 数据库SGA内存参数的优化

当前节点1主机内存大小为32GB,节点2由于更换过硬件设备,内存为128GB。

因此之前出于两节点参数统一未对节点2的内存参数进行增加,使用的是sga_target= 12884901888,即12GB。

建议将两节点内存统计设置为SGA设置为节点1内存的一半大小,即16GB。(注意此操作需要重启数据库需要安排停机维护时间窗口进行

2.数据库buffer cache及shared pool组件优化

建议设置最小的buffer cache及shared pool的值

这是用来设置下限值,使SGA自动管理时不会调整到低于此值。具体值应结合历史AWR报告中Report Summary部分的Cache Sizes大小及Advisory Statistics 部分建议的 Buffer Pool Advisory/Shared Pool Advisory值。使用的命令如下:(注意在非业务高峰时或安排停机维护时间窗口进行

SQL> ALTER SYSTEM SET DB_CACHE_SIZE=n SCOPE=SPFILE;

SQL> ALTER SYSTEM SET SHARED_POOL_SIZE=m SCOPE=SPFILE;

建议可以设置为BUFFER CACHE 10GB,SHARED POOL 5GB。

3.关闭DRM相关参数

10G

    _gc_affinity_time=0  

    _gc_undo_affinity=FALSE  

这2个参数是静态参数,也就是说必须要重启实例才能生效。

实际上可以设置另外2个动态的隐含参数,来达到这个目的。按下面的值设置这2个参数之后,不能完全算是禁止/关闭了DRM,而是从”事实上“关闭了DRM。

--可以设置更高的值。

    _gc_affinity_limit=250  

    _gc_affinity_minimum=10485760  

alter system set "_gc_affinity_limit"=250;

alter system set "_gc_affinity_minimum"=10485760;

甚至可以将以上2个参数值设置得更大。这2个参数是立即生效的,在所有的节点上设置这2个参数之后,系统不再进行DRM。

4.建议继续加强对于RAC集群、数据库运行情况、主机资源使用的监控。再次出现类似异常时可以手动收集HANG和Systemstate dump信息来辅助分析,以协助排查。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值