ORA-00494: enqueue [CF] held for too long (more than 900 seconds) -RAC

最新推荐文章于 2022-03-08 17:20:01 发布

csdn0622

最新推荐文章于 2022-03-08 17:20:01 发布

阅读量286

点赞数 1

文章标签：网络数据库运维

客户的一台rac一个月hang死一次，上次重启机器后一切正常，这次必须一次性解决，生产库不可能老是重启赛。

##############################################
系统软件日志：syslog.log
##############
Jan 6 04:57:51 db2 cimserver[1971]: PGS13606: A failure was detected in provider module HPUXStorageProviderModule. The generation of indications by pro
viders in this module may be affected. To ensure these providers are serving active subscriptions, disable and then re-enable this module using the cimproviu
der command.
Jan 6 06:24:17 db2 automountd[1485]: match_addr: getaddrinfo failed: host nor service provided, or not known
Jan 6 06:24:17 db2 automountd[1485]: match_addr: getaddrinfo failed: host nor service provided, or not known
Jan 6 06:41:47 db2 above message repeats 47 times
##############
uamdb2 automountd[1485]和oracle的ORA-00494: enqueue [CF] held 同时报错，在11月28日出问题时也是同时报错。
咨询HP工程师，回复说uamdb2 automountd[1485]错误是由网络问题，或者是DNS服务解析出问题引起的。
停止了系统的autofs服务后，系统报错消失，但oracle仍然报错ORA-00494: enqueue [CF] held

##############################################
oracle日志：alert_db2.log
同时发现db2_diag_26520.trc未生成
##############
Wed Jan 6 06:24:15 2010
Errors in file /oracle/admin/db/bdump/db2_diag_26520.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 22109'
Wed Jan 6 06:24:18 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:24:37 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:25:15 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:25:34 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:27:24 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:28:24 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:28:31 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:29:31 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
##############################################
oracle 监听器日志：/oracle/product/102/network/log/listener_db2.log
##############
06-JAN-2010 06:08:41 * service_died * uamdb1 * 12537
06-JAN-2010 06:09:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman@db1)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57242)) * establish * db * 0
06-JAN-2010 06:09:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=uamdb)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57244)) * establish * db * 0
06-JAN-2010 06:09:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57246)) * establish * uamdb * 0
06-JAN-2010 06:09:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57247)) * establish * db * 0
06-JAN-2010 06:09:04 * service_update * db2 * 0
06-JAN-2010 06:09:22 * service_update * db2 * 0
06-JAN-2010 06:09:44 * service_register * db1 * 1184
##############################################
rac crs资源检查
##############
node1
##############
uamdb1#[/]crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora.uamdb.db   application    ONLINE    ONLINE    uamdb2
ora....b1.inst application    ONLINE    ONLINE    uamdb1
ora....b2.inst application    ONLINE    ONLINE    uamdb2
ora....SM1.asm application    ONLINE    ONLINE    uamdb1
ora....B1.lsnr application    ONLINE    ONLINE    uamdb1
ora.uamdb1.gsd application    ONLINE    ONLINE    uamdb1
ora.uamdb1.ons application    ONLINE    ONLINE    uamdb1
ora.uamdb1.vip application    ONLINE    ONLINE    uamdb1
ora....SM2.asm application    ONLINE    ONLINE    uamdb2
ora....B2.lsnr application    ONLINE    ONLINE    uamdb2
ora.uamdb2.gsd application    ONLINE    ONLINE    uamdb2
ora.uamdb2.ons application    ONLINE    ONLINE    uamdb2
ora.uamdb2.vip application    ONLINE    ONLINE    uamdb2
uamdb1#[/]crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
##############
通过以上检查，发现node1资源状态完全正常
##############
node2
##############
uamdb2#[/]crs_stat -t
#挂死，不出结果
uamdb2#[/]crsctl check crs
#挂死，不出结果
uamdb2#[/]crsctl check crsd
CRS appears healthy
uamdb2#[/]crsctl check evmd
EVM appears healthy
uamdb2#[/]crsctl check cssd
#挂死，不出结果
##############
通过以上检查，发现node2资源状态不正常，css挂死。

##############
检查oracle系统问题
##############
node2
##############
sys@DB_SQL> select event,count(*) from v$active_session_history group by event order by 2;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
PX Deq: reap credit                                                       1
name-service call wait                                                    1
os thread startup                                                        14
control file sequential read                                             19
ksdxexeotherwait                                                         56
enq: WF - contention                                                    859
control file parallel write                                            2782
enq: CF - contention                                                   5493
                                                                      24914
##############
enq: CF - contention ：5493   CFqueue很严重
##############
锁等待：
sys@DB_SQL>   SELECT s.username,l.OBJECT_ID,l.SESSION_ID,s.SERIAL#,l.ORACLE_USERNAME,l.OS_USER_NAME,l.PROCESS FROM V$LOCKED_OBJECT l,V$SESSION S WHERE l.SESSION_ID=S.SID;

USERNAME                        OBJECT_ID SESSION_ID    SERIAL# ORACLE_USERNAME                OS_USER_NAME                   PROCESS
------------------------------ ---------- ---------- ---------- ------------------- ------------------------------
                                     9167       1052      40224                                oracle                         24193
                                     9238       1052      40224                                oracle                         24193

2 rows selected.
sys@DB_SQL> select sid,serial#,username,program,sql_hash_value,
2           to_char(logon_time,'yyyy/mm/dd hh24:mi:ss') as login_time
3    from v$session
4   where paddr in ( select addr from v$process where spid=&spid);
Enter value for spid: 24193

   SID SERIAL# USERNAME   PROGRAM                          hash_value LOGIN_TIME
------ ------- ---------- -------------------------------- ---------- -------------------
1052   40224            oracle@db2 (m000)                      0 2010/01/07 09:33:18
1 row selected.
sys@UAMDB_SQL> SELECT l.sid, s.blocking_session blocker, s.event, l.type, l.lmode, l.request, o.object_name, o.object_type
2 FROM v$lock l, dba_objects o, v$session s
3 WHERE l.id1 = o.object_id (+)
4 AND   l.sid= s.sid
5 and l.sid=&sid
6 ORDER BY sid, type;
Enter value for sid: 1052

   SID    BLOCKER EVENT                TYPE            LMODE    REQUEST OBJECT_NAME OBJECT_TYPE
------ ---------- -------------------- ---------- ---------- ----------
1052            gc current request   SH                  6          0
1052            gc current request   TM                  6          0 WRH$_ACTIVE_SESSION_HISTORY TABLE
1052            gc current request   TM                  3          0 WRM$_DATABASE_INSTANCE TABLE
1052            gc current request   TX                  6          0
1052            gc current request   WF                  4          0
sys@DB_SQL> SELECT s.sid , s.serial#, s.status, s.server, s.username, e.event, e.time_waited
2 FROM v$session_event e, v$session s
3 WHERE e.sid=s.sid
4 and e.sid=&sid;
Enter value for sid: 1052

   SID SERIAL# STATUS   SERVER    USERNAME   EVENT                TIME_WAITED
------ ------- -------- --------- ---------- -------------------- -----------
1052   40224 ACTIVE   DEDICATED            class slave wait               0
1052   40224 ACTIVE   DEDICATED            control file sequential read           0
1052   40224 ACTIVE   DEDICATED            gc cr block 2-way              0
1052   40224 ACTIVE   DEDICATED            gc current block 2-way           0
1052   40224 ACTIVE   DEDICATED            gc current block busy           0
1052   40224 ACTIVE   DEDICATED            gc current grant busy           0
1052   40224 ACTIVE   DEDICATED            enq: TM - contention           4
1052   40224 ACTIVE   DEDICATED            row cache lock                 0
1052   40224 ACTIVE   DEDICATED            library cache lock             0
1052   40224 ACTIVE   DEDICATED            events in waitclass Other            0

该session 有4个TM等待，TM等待一般是由于外键没有索引引起的
################
外键没有索引的列
################
sys@DB_SQL> SELECT * FROM (
2     SELECT c.table_name, cc.column_name, cc.position column_position
3     FROM   user_constraints c, user_cons_columns cc
4     WHERE c.constraint_name = cc.constraint_name
5     AND    c.constraint_type = 'R'
6     MINUS
7     SELECT i.table_name, ic.column_name, ic.column_position
8     FROM   user_indexes i, user_ind_columns ic
9     WHERE i.index_name = ic.index_name
10     )
11 ORDER BY table_name, column_position;

TABLE_NAME                     COLUMN_NAME                              COLUMN_POSITION
------------------------------ ---------------------------------------- ---------------
HS$_CLASS_CAPS                 CAP_NUMBER                                             1
HS$_CLASS_DD                   DD_TABLE_ID                                            1
HS$_FDS_INST                   FDS_CLASS_ID                                           1
HS$_INST_CAPS                  CAP_NUMBER                                             1
HS$_INST_DD                    DD_TABLE_ID                                            1
OLAP_OLEDB_MDPROPS             VALUE                                                  1
REGISTRY$                      PID                                                    2
WRM$_SNAPSHOT                  INSTANCE_NUMBER                                        2
WRM$_SNAPSHOT                  STARTUP_TIME                                           3

9 rows selected.
#######################
enqueue的状态
#######################
sys@DB_SQL> select * from v$enqueue_stat where cum_wait_time>0 order by inst_id,cum_wait_time;

   INST_ID EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ# CUM_WAIT_TIME
---------- -- ---------- ----------- ---------- ----------- -------------
         2 KM          2           1          2           0             1
         2 RR          2           1          2           0             1
         2 PE     397651           3     397651           0             1
         2 TL          2           1          2           0             1
         2 SW         29           1         29           0             1
         2 XR        781           1        781           0             1
         2 CN          7           1          7           0             1
         2 PW         14           2          2          12             2
         2 RT          6           4          5           1             4
         2 PG         27          10         27           0             5
         2 DL         82          16         82           0             6
         2 BR        102          41        102           0            11
         2 KO         81          27         81           0            12
         2 UL       1395         149       1395           0            32
         2 WP         38          32         38           0            36
         2 MR      62790          68      62790           0            38
         2 SK        263         120        263           0            40
         2 IR       7173         160       7173           0            46
         2 TS       2904         381       2904           0            84
         2 PI        600         211        600           0            94
         2 MD        487         244        487           0           100
         2 TO       1139         543       1139           0           142
         2 RS       1332         599       1332           0           178
         2 TC       1095         776       1095           0           308
         2 TX     403790         502     403790           0           422
         2 FB       9488        2532       9488           0           682
         2 TT     362946        3822     362946           0           882
         2 AF      17905        1913      17905           0          1075
         2 MW       5111        3746       5111           0          1527
         2 HW     610709        5172     610709           0          1632
         2 PR      28957          36      28957           0          1763
         2 DR       5112        5014       5112           0          2646
         2 TD      21603       15045      21603           0          3441
         2 RO      21953        7763      21953           0          4177
         2 US      40633       32406      40633           0          6173
         2 TA      35746       26669      35746           0          9295
         2 JS 104933748        5264 104933573         175         16486
         2 WL        580         260        358         222        178461
         2 PS    1550184      690899    1537349       12835        250683
         2 WF      63022       40630      44837       18185      47065384
         2 FU        264         261          8         256     230306409
         2 CF   20987280       11795   20970328       16933     407841619
         2 TM     664700       13494     664021         679     926373131

CF和TM都是相当的高。
##############
node1
##############
sys@DB_SQL> SELECT s.username,l.OBJECT_ID,l.SESSION_ID,s.SERIAL#,l.ORACLE_USERNAME,l.OS_USER_NAME,l.PROCESS FROM V$LOCKED_OBJECT l,V$SESSION S WHERE l.SESSION_ID=S.SID;

no rows selected
##############
没有锁等待
sys@DB_SQL> select event,count(*) from v$active_session_history group by event order by 2;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
rdbms ipc reply                                                           1
gc current block busy                                                     1
null event                                                                1
gc cr grant 2-way                                                         1
enq: PS - contention                                                      1
gc current block 2-way                                                    2
library cache lock                                                        2
latch free                                                                2
gc cr block 2-way                                                         3
gc current grant busy                                                     4
log file sequential read                                                  4
Log archive I/O                                                           4
DFS lock handle                                                           5
gc cr disk read                                                           5
Backup: sbtclose2                                                         6
ges inquiry response                                                      7
row cache lock                                                           10
wait for scn ack                                                         18
db file parallel write                                                   19
log file parallel write                                                  21
control file parallel write                                              22
Backup: sbtinit2                                                         36
log file sync                                                            41
control file sequential read                                             50
os thread startup                                                        55
Backup: sbtbackup                                                        83
db file sequential read                                                 111
ksdxexeotherwait                                                        186
RMAN backup & recovery I/O                                              328
                                                                        693
Backup: sbtwrite2                                                      1591
enq: WF - contention                                                   2709
reliable message                                                       3714
enq: TM - contention                                                  10651
enq: CF - contention                                                  15625

35 rows selected.

sys@DB_SQL> select * from v$enqueue_stat where cum_wait_time>0 order by inst_id,cum_wait_time;

   INST_ID EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ# CUM_WAIT_TIME
---------- -- ---------- ----------- ---------- ----------- -------------
         1 TL          1           1          1           0             1
         1 FU         57           2         57           0             3
         1 DL         62          11         62           0             5
         1 SK         76          22         76           0            14
         1 PI        160          53        160           0            15
         1 PG         32          25         32           0            16
         1 MR      23713          34      23713           0            24
         1 WP         21          21         21           0            25
         1 MD        134          68        134           0            32
         1 TO        312         153        312           0            45
         1 RS        364         160        364           0            63
         1 BR        459         176        459           0            68
         1 HW     622463         747     622463           0           230
         1 TX    1785623          84    1785623           0           284
         1 PR       9439           6       9439           0           317
         1 AF       5256         573       5256           0           385
         1 FB       7088        1358       7088           0           597
         1 MW       1364        1364       1364           0           608
         1 TC        685         203        685           0           645
         1 DR       1364        1364       1364           0           902
         1 TT     100900        4119     100900           0           975
         1 RO       7356        2134       7356           0          1037
         1 TD       5352        5352       5352           0          1511
         1 US      10831        6203      10831           0          1523
         1 JS   15730382        1599   15730336          46          2122
         1 TA       8234        8107       8234           0          3271
         1 PS     405099      156349     401698        3401         63259
         1 WL        159          63         92          67         64756
         1 TM    4522892        4094    4522890           0        317188
         1 WF      19798        7608      13427        6371       3337224
         1 CF    5556633        3787    5554145        2482      66750958
###################################################################################
问题现状跟Bug 6857917很相像。

Description

An Oracle foreground process can cause a cluster wide hang if it is holding a critical enqueue (such as CF) and it hangs when registering with CSS. With this fix the process will attempt to register BEFORE acquiring such an enqueue. Note: This fix does not attempt to fix any process hang when registering will CSS, but instead tries to ensure that if such a hang does occur then the impact is reduced.

#####################

最终处理方式：升级到10.2.0.4.2后，数据库正常，再也没有发生hang死的情况。

#####################

故障在2个月后又重现，这次oracle的原厂工程师分析后，指出系统的内存泄漏的bug问题，同时我也处理了个oracle的内存泄漏的bug，期待问题能解决。

hp：建议补丁：建议安装补丁：PHKL_40208；我最终是打上了最新的golden patch

oracle：

------------------------------------

下面是METALINK给出的解释：

I have checked the information provided. crsd is reporting timeouts when running the checking
actions. The racgmain process on the other hand, executed by racgwrap, remain
running so the number is increasing.

This issue has been reported in Bug 7009245 ""RACGMAIN CHECK" PROCESS NOT TERMINATING", and drill down and
determine the root cause this require to get a stack trace from racgmain
process(es) as well as you should run OSWatcher with a specific parameters,
finally the issue of Bug 7009245 has been fixed by the fix of Bug 6196746
(unpublished bug -- summary is "racgwrap spawns racgmain child process. crsd
timeout kills racgwrap but bot racgmain")

So please perform. the following action plan that intend to solve the issue.
.
ACTION PLAN
============
1. Stop CRS on this node.
2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin
3. Edit the file racgwrap and modify the last 3 lines from:
.
$ORACLE_HOME/bin/racgmain "$@"
status=$?
exit $status
.
to:
.
# Line added to test fix for Bug 6196746
exec $ORACLE_HOME/bin/racgmain "$@"
.
4. Restart CRS and make sure that all the resources are startes

This should solve the issue. also if you are not able to stop CRS right now, you can try the
solution without crs stop/start.