客户的一台rac一个月hang死一次,上次重启机器后一切正常,这次必须一次性解决,生产库不可能老是重启赛。
##############################################
系统软件日志:syslog.log
##############
Jan 6 04:57:51 db2 cimserver[1971]: PGS13606: A failure was detected in provider module HPUXStorageProviderModule. The generation of indications by pro
viders in this module may be affected. To ensure these providers are serving active subscriptions, disable and then re-enable this module using the cimproviu
der command.
Jan 6 06:24:17 db2 automountd[1485]: match_addr: getaddrinfo failed: host nor service provided, or not known
Jan 6 06:24:17 db2 automountd[1485]: match_addr: getaddrinfo failed: host nor service provided, or not known
Jan 6 06:41:47 db2 above message repeats 47 times
##############
uamdb2 automountd[1485]和oracle的ORA-00494: enqueue [CF] held 同时报错,在11月28日出问题时也是同时报错。
咨询HP工程师,回复说uamdb2 automountd[1485]错误是由网络问题,或者是DNS服务解析出问题引起的。
停止了系统的autofs服务后,系统报错消失,但oracle仍然报错ORA-00494: enqueue [CF] held
##############################################
oracle日志:alert_db2.log
同时发现db2_diag_26520.trc未生成
##############
Wed Jan 6 06:24:15 2010
Errors in file /oracle/admin/db/bdump/db2_diag_26520.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 22109'
Wed Jan 6 06:24:18 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:24:37 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:25:15 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:25:34 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:27:24 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:28:24 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:28:31 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:29:31 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
##############################################
oracle 监听器日志:/oracle/product/102/network/log/listener_db2.log
##############
06-JAN-2010 06:08:41 * service_died * uamdb1 * 12537
06-JAN-2010 06:09:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman@db1)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57242)) * establish * db * 0
06-JAN-2010 06:09:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=uamdb)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57244)) * establish * db * 0
06-JAN-2010 06:09:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57246)) * establish * uamdb * 0
06-JAN-2010 06:09:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57247)) * establish * db * 0
06-JAN-2010 06:09:04 * service_update * db2 * 0
06-JAN-2010 06:09:22 * service_update * db2 * 0
06-JAN-2010 06:09:44 * service_register * db1 * 1184
##############################################
rac crs资源检查
##############
node1
##############
uamdb1#[/]crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.uamdb.db application ONLINE ONLINE uamdb2
ora....b1.inst application ONLINE ONLINE uamdb1
ora....b2.inst application ONLINE ONLINE uamdb2
ora....SM1.asm application ONLINE ONLINE uamdb1
ora....B1.lsnr application ONLINE ONLINE uamdb1
ora.uamdb1.gsd application ONLINE ONLINE uamdb1
ora.uamdb1.ons application ONLINE ONLINE uamdb1
ora.uamdb1.vip application ONLINE ONLINE uamdb1
ora....SM2.asm application ONLINE ONLINE uamdb2
ora....B2.lsnr application ONLINE ONLINE uamdb2
ora.uamdb2.gsd application ONLINE ONLINE uamdb2
ora.uamdb2.ons application ONLINE ONLINE uamdb2
ora.uamdb2.vip application ONLINE ONLINE uamdb2
uamdb1#[/]crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
##############
通过以上检查,发现node1资源状态完全正常
##############
node2
##############
uamdb2#[/]crs_stat -t
#挂死,不出结果
uamdb2#[/]crsctl check crs
#挂死,不出结果
uamdb2#[/]crsctl check crsd
CRS appears healthy
uamdb2#[/]crsctl check evmd
EVM appears healthy
uamdb2#[/]crsctl check cssd
#挂死,不出结果
##############
通过以上检查,发现node2资源状态不正常,css挂死。
##############
检查oracle系统问题
##############
node2
##############
sys@DB_SQL> select event,count(*) from v$active_session_history group by event order by 2;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
PX Deq: reap credit 1
name-service call wait 1
os thread startup 14
control file sequential read 19
ksdxexeotherwait 56
enq: WF - contention 859
control file parallel write 2782
enq: CF - contention 5493
24914
##############
enq: CF - contention :5493 CFqueue很严重
##############
锁等待:
sys@DB_SQL> SELECT s.username,l.OBJECT_ID,l.SESSION_ID,s.SERIAL#,l.ORACLE_USERNAME,l.OS_USER_NAME,l.PROCESS FROM V$LOCKED_OBJECT l,V$SESSION S WHERE l.SESSION_ID=S.SID;
USERNAME OBJECT_ID SESSION_ID SERIAL# ORACLE_USERNAME OS_USER_NAME PROCESS
------------------------------ ---------- ---------- ---------- ------------------- ------------------------------
9167 1052 40224 oracle 24193
9238 1052 40224 oracle 24193
2 rows selected.
sys@DB_SQL> select sid,serial#,username,program,sql_hash_value,
2 to_char(logon_time,'yyyy/mm/dd hh24:mi:ss') as login_time
3 from v$session
4 where paddr in ( select addr from v$process where spid=&spid);
Enter value for spid: 24193
SID SERIAL# USERNAME PROGRAM hash_value LOGIN_TIME
------ ------- ---------- -------------------------------- ---------- -------------------
1052 40224 oracle@db2 (m000) 0 2010/01/07 09:33:18
1 row selected.
sys@UAMDB_SQL> SELECT l.sid, s.blocking_session blocker, s.event, l.type, l.lmode, l.request, o.object_name, o.object_type
2 FROM v$lock l, dba_objects o, v$session s
3 WHERE l.id1 = o.object_id (+)
4 AND l.sid= s.sid
5 and l.sid=&sid
6 ORDER BY sid, type;
Enter value for sid: 1052
SID BLOCKER EVENT TYPE LMODE REQUEST OBJECT_NAME OBJECT_TYPE
------ ---------- -------------------- ---------- ---------- ----------
1052 gc current request SH 6 0
1052 gc current request TM 6 0 WRH$_ACTIVE_SESSION_HISTORY TABLE
1052 gc current request TM 3 0 WRM$_DATABASE_INSTANCE TABLE
1052 gc current request TX 6 0
1052 gc current request WF 4 0
sys@DB_SQL> SELECT s.sid , s.serial#, s.status, s.server, s.username, e.event, e.time_waited
2 FROM v$session_event e, v$session s
3 WHERE e.sid=s.sid
4 and e.sid=&sid;
Enter value for sid: 1052
SID SERIAL# STATUS SERVER USERNAME EVENT TIME_WAITED
------ ------- -------- --------- ---------- -------------------- -----------
1052 40224 ACTIVE DEDICATED class slave wait 0
1052 40224 ACTIVE DEDICATED control file sequential read 0
1052 40224 ACTIVE DEDICATED gc cr block 2-way 0
1052 40224 ACTIVE DEDICATED gc current block 2-way 0
1052 40224 ACTIVE DEDICATED gc current block busy 0
1052 40224 ACTIVE DEDICATED gc current grant busy 0
1052 40224 ACTIVE DEDICATED enq: TM - contention 4
1052 40224 ACTIVE DEDICATED row cache lock 0
1052 40224 ACTIVE DEDICATED library cache lock 0
1052 40224 ACTIVE DEDICATED events in waitclass Other 0
该session 有4个TM等待,TM等待一般是由于 外键没有索引 引起的
################
外键没有索引的列
################
sys@DB_SQL> SELECT * FROM (
2 SELECT c.table_name, cc.column_name, cc.position column_position
3 FROM user_constraints c, user_cons_columns cc
4 WHERE c.constraint_name = cc.constraint_name
5 AND c.constraint_type = 'R'
6 MINUS
7 SELECT i.table_name, ic.column_name, ic.column_position
8 FROM user_indexes i, user_ind_columns ic
9 WHERE i.index_name = ic.index_name
10 )
11 ORDER BY table_name, column_position;
TABLE_NAME COLUMN_NAME COLUMN_POSITION
------------------------------ ---------------------------------------- ---------------
HS$_CLASS_CAPS CAP_NUMBER 1
HS$_CLASS_DD DD_TABLE_ID 1
HS$_FDS_INST FDS_CLASS_ID 1
HS$_INST_CAPS CAP_NUMBER 1
HS$_INST_DD DD_TABLE_ID 1
OLAP_OLEDB_MDPROPS VALUE 1
REGISTRY$ PID 2
WRM$_SNAPSHOT INSTANCE_NUMBER 2
WRM$_SNAPSHOT STARTUP_TIME 3
9 rows selected.
#######################
enqueue的状态
#######################
sys@DB_SQL> select * from v$enqueue_stat where cum_wait_time>0 order by inst_id,cum_wait_time;
INST_ID EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ# CUM_WAIT_TIME
---------- -- ---------- ----------- ---------- ----------- -------------
2 KM 2 1 2 0 1
2 RR 2 1 2 0 1
2 PE 397651 3 397651 0 1
2 TL 2 1 2 0 1
2 SW 29 1 29 0 1
2 XR 781 1 781 0 1
2 CN 7 1 7 0 1
2 PW 14 2 2 12 2
2 RT 6 4 5 1 4
2 PG 27 10 27 0 5
2 DL 82 16 82 0 6
2 BR 102 41 102 0 11
2 KO 81 27 81 0 12
2 UL 1395 149 1395 0 32
2 WP 38 32 38 0 36
2 MR 62790 68 62790 0 38
2 SK 263 120 263 0 40
2 IR 7173 160 7173 0 46
2 TS 2904 381 2904 0 84
2 PI 600 211 600 0 94
2 MD 487 244 487 0 100
2 TO 1139 543 1139 0 142
2 RS 1332 599 1332 0 178
2 TC 1095 776 1095 0 308
2 TX 403790 502 403790 0 422
2 FB 9488 2532 9488 0 682
2 TT 362946 3822 362946 0 882
2 AF 17905 1913 17905 0 1075
2 MW 5111 3746 5111 0 1527
2 HW 610709 5172 610709 0 1632
2 PR 28957 36 28957 0 1763
2 DR 5112 5014 5112 0 2646
2 TD 21603 15045 21603 0 3441
2 RO 21953 7763 21953 0 4177
2 US 40633 32406 40633 0 6173
2 TA 35746 26669 35746 0 9295
2 JS 104933748 5264 104933573 175 16486
2 WL 580 260 358 222 178461
2 PS 1550184 690899 1537349 12835 250683
2 WF 63022 40630 44837 18185 47065384
2 FU 264 261 8 256 230306409
2 CF 20987280 11795 20970328 16933 407841619
2 TM 664700 13494 664021 679 926373131
CF和TM都是相当的高。
##############
node1
##############
sys@DB_SQL> SELECT s.username,l.OBJECT_ID,l.SESSION_ID,s.SERIAL#,l.ORACLE_USERNAME,l.OS_USER_NAME,l.PROCESS FROM V$LOCKED_OBJECT l,V$SESSION S WHERE l.SESSION_ID=S.SID;
no rows selected
##############
没有锁等待
sys@DB_SQL> select event,count(*) from v$active_session_history group by event order by 2;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
rdbms ipc reply 1
gc current block busy 1
null event 1
gc cr grant 2-way 1
enq: PS - contention 1
gc current block 2-way 2
library cache lock 2
latch free 2
gc cr block 2-way 3
gc current grant busy 4
log file sequential read 4
Log archive I/O 4
DFS lock handle 5
gc cr disk read 5
Backup: sbtclose2 6
ges inquiry response 7
row cache lock 10
wait for scn ack 18
db file parallel write 19
log file parallel write 21
control file parallel write 22
Backup: sbtinit2 36
log file sync 41
control file sequential read 50
os thread startup 55
Backup: sbtbackup 83
db file sequential read 111
ksdxexeotherwait 186
RMAN backup & recovery I/O 328
693
Backup: sbtwrite2 1591
enq: WF - contention 2709
reliable message 3714
enq: TM - contention 10651
enq: CF - contention 15625
35 rows selected.
sys@DB_SQL> select * from v$enqueue_stat where cum_wait_time>0 order by inst_id,cum_wait_time;
INST_ID EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ# CUM_WAIT_TIME
---------- -- ---------- ----------- ---------- ----------- -------------
1 TL 1 1 1 0 1
1 FU 57 2 57 0 3
1 DL 62 11 62 0 5
1 SK 76 22 76 0 14
1 PI 160 53 160 0 15
1 PG 32 25 32 0 16
1 MR 23713 34 23713 0 24
1 WP 21 21 21 0 25
1 MD 134 68 134 0 32
1 TO 312 153 312 0 45
1 RS 364 160 364 0 63
1 BR 459 176 459 0 68
1 HW 622463 747 622463 0 230
1 TX 1785623 84 1785623 0 284
1 PR 9439 6 9439 0 317
1 AF 5256 573 5256 0 385
1 FB 7088 1358 7088 0 597
1 MW 1364 1364 1364 0 608
1 TC 685 203 685 0 645
1 DR 1364 1364 1364 0 902
1 TT 100900 4119 100900 0 975
1 RO 7356 2134 7356 0 1037
1 TD 5352 5352 5352 0 1511
1 US 10831 6203 10831 0 1523
1 JS 15730382 1599 15730336 46 2122
1 TA 8234 8107 8234 0 3271
1 PS 405099 156349 401698 3401 63259
1 WL 159 63 92 67 64756
1 TM 4522892 4094 4522890 0 317188
1 WF 19798 7608 13427 6371 3337224
1 CF 5556633 3787 5554145 2482 66750958
###################################################################################
问题现状跟Bug 6857917很相像。
Description
An Oracle foreground process can cause a cluster wide hang if it is holding a critical enqueue (such as CF) and it hangs when registering with CSS. With this fix the process will attempt to register BEFORE acquiring such an enqueue. Note: This fix does not attempt to fix any process hang when registering will CSS, but instead tries to ensure that if such a hang does occur then the impact is reduced.
#####################
最终处理方式:升级到10.2.0.4.2后,数据库正常,再也没有发生hang死的情况。
#####################
故障在2个月后又重现,这次oracle的原厂工程师分析后,指出系统的内存泄漏的bug问题,同时我也处理了个oracle的内存泄漏的bug,期待问题能解决。
hp:建议补丁:建议安装补丁:PHKL_40208;我最终是打上了最新的golden patch
oracle:
------------------------------------
下面是METALINK给出的解释:
I have checked the information provided. crsd is reporting timeouts when running the checking
actions. The racgmain process on the other hand, executed by racgwrap, remain
running so the number is increasing.
This issue has been reported in Bug 7009245 ""RACGMAIN CHECK" PROCESS NOT TERMINATING", and drill down and
determine the root cause this require to get a stack trace from racgmain
process(es) as well as you should run OSWatcher with a specific parameters,
finally the issue of Bug 7009245 has been fixed by the fix of Bug 6196746
(unpublished bug -- summary is "racgwrap spawns racgmain child process. crsd
timeout kills racgwrap but bot racgmain")
So please perform. the following action plan that intend to solve the issue.
.
ACTION PLAN
============
1. Stop CRS on this node.
2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin
3. Edit the file racgwrap and modify the last 3 lines from:
.
$ORACLE_HOME/bin/racgmain "$@"
status=$?
exit $status
.
to:
.
# Line added to test fix for Bug 6196746
exec $ORACLE_HOME/bin/racgmain "$@"
.
4. Restart CRS and make sure that all the resources are startes
This should solve the issue. also if you are not able to stop CRS right now, you can try the
solution without crs stop/start.
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/11088128/viewspace-675791/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/11088128/viewspace-675791/