对于RAC环境而言,两个节点间出现资源抢占的情况不足为奇,不过如果类型的信息有规律的频繁发生,就说明一定问题了。
在RAC环境的其中一个节点上,观察到alert文件中包含了大量的GES信息:
Wed Dec 23 04:50:02 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
Wed Dec 23 05:16:12 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
Wed Dec 23 05:50:04 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
Wed Dec 23 06:16:13 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
Wed Dec 23 06:50:02 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
.
.
.
Wed Dec 23 13:16:21 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
Wed Dec 23 13:50:16 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
Wed Dec 23 14:16:23 2009
GES: Potential blocker (pid=5746) on resource FU-00000000-00000000;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_7668.trc and DIAG trace file
错误信息出现的相当有规律,每个小时两次,是在16分和50分的时候出现。而错误信息指出,当前会话要求的资源被潜在的持有者锁住。
检查一下对应的trace文件:
*** 2009-12-23 14:50:19.090
DUMP LOCAL BLOCKER/HOLDER: block level 5 res [0x0][0x0],[FU]
----------resource 0x4051f3fa0----------------------
resname : [0x0][0x0],[FU]
Local node : 1
dir_node : 0
master_node : 0
hv idx : 95
hv last r.inc : 32
current inc : 34
hv status : 0
hv master : 0
open options : dd
Held mode : KJUSEREX
Cvt mode : KJUSERNL
Next Cvt mode : KJUSERNL
msg_seq : 1f
res_seq : 727
grant_bits : KJUSERNL KJUSEREX
grant mode : KJUSERNL KJUSERCR KJUSERCW KJUSERPR KJUSERPW KJUSEREX
count : 1 0 0 0 0 1
val_state : KJUSERVS_NOVALUE
valblk : 0x00000000000000000000000000000000 .
access_node : 1
vbreq_state : 0
state : x8
resp : 4051f3fa0
On Scan_q? : N
Total accesses: 201698
Imm. accesses: 4178
Granted_locks : 1
Cvting_locks : 1
value_block: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
GRANTED_Q :
lp 6d1451ed0 gl KJUSEREX rp 4051f3fa0 [0x0][0x0],[FU]
master 0 gl owner 6d11c8a80 possible pid 5746 xid 2002-002D-0000FDD7 bast 0 rseq 727 mseq 0 history 0x9a5149a5
open opt KJUSERDEADLOCK KJUSERIDLK
CONVERT_Q:
lp 6d1398f40 gl KJUSERNL rl KJUSEREX rp 4051f3fa0 [0x0][0x0],[FU]
master 0 gl owner 6d11b3f80 possible pid 3518 xid 2003-003D-00000628 bast 0 rseq 727 mseq 0 history 0x1495149a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERTRCCANCEL
----------enqueue 0x6d1451ed0------------------------
lock version : 9259971
Owner node : 1
grant_level : KJUSEREX
req_level : KJUSEREX
bast_level : KJUSERNL
notify_func : 0
resp : 4051f3fa0
procp : 5396a5b30
pid : 5746
proc version : 876
oprocp : 0
opid : 0
group lock owner : 6d11c8a80
possible pid : 5746
xid : 2002-002D-0000FDD7
dd_time : 0.0 secs
dd_count : 0
timeout : 0.0 secs
On_timer_q? : N
On_dd_q? : N
lock_state : GRANTED
Open Options : KJUSERDEADLOCK KJUSERIDLK
Convert options : KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERTRCCANCEL
History : 0x9a5149a5
Msg_Seq : 0x0
res_seq : 727
valblk : 0x00000000000000000000000000000000 .
Potential blocker (pid=5746) on resource FU-00000000-00000000
DUMP LOCAL BLOCKER: initiate state dump for TIMEOUT
possible owner[45.5746]
Submitting asynchronized dump request [28]
从上面的信息可以看到,问题和进程5746有关,对于操作系统的进程5746对应到数据字典V$PROCESS就是SPID,查询视图V$PROCESS:
SQL> SELECT PID, SPID, PROGRAM
2 FROM V$PROCESS
3 WHERE SPID = 5746;
PID SPID PROGRAM
---------- ------------ ------------------------------------------------
45 5746 oracle@newtrade2 (m001)
这是Oracle后台用于统计功能的轻量级进程。检查进程对应的会话:
SQL> SELECT SID, SERIAL#, STATUS, MODULE, ACTION
2 FROM V$SESSION
3 WHERE PADDR IN
4 (SELECT ADDR
5 FROM V$PROCESS
6 WHERE SPID = 5746);
SID SERIAL# STATUS MODULE ACTION
---------- ---------- -------- --------------------------------- ----------------------
537 27331 ACTIVE MMON_SLAVE Auto-DBFUS Action
这个进程确实是定时启动,因此这个进程导致错误发生的如此规律是合理的,不过肯定存在一个长时间僵死的会话,否则不会导致每次m001进程都被迫中止。
检查系统锁信息:
SQL> SELECT SID, TYPE, ID1, ID2, LMODE, REQUEST, CTIME, BLOCK
2 FROM V$LOCK
3 WHERE TYPE != 'MR'
4 ORDER BY CTIME DESC;
SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
547 XR 4 0 1 0 5466896 2
547 CF 0 0 2 0 5466896 2
273 XR 0 0 1 0 5466896 2
548 RT 2 0 6 0 5466893 2
547 RS 25 1 2 0 5466893 2
276 DM 1 0 4 0 5466890 2
276 RT 2 2 6 0 5466890 2
548 RT 2 1 6 0 5466890 2
546 TS 3 2 3 0 5466885 2
261 WF 0 0 4 0 330649 2
261 WF 70 0 6 0 330646 2
261 WF 12 0 6 0 330646 2
261 TX 1179649 117887 6 0 330646 2
261 TM 8964 0 3 0 330646 2
261 TM 8968 0 3 0 330646 2
261 TM 8958 0 3 0 330646 2
235 JQ 0 12 6 0 323471 2
517 JQ 0 13 6 0 323470 2
235 TO -39860071 1 3 0 323466 2
235 TX 1114182 75160 6 0 323466 2
517 TM 179809 0 3 0 323461 2
517 TO 179809 1 3 0 323461 2
517 TX 720966 117862 6 0 323461 2
249 TM 9201 0 3 0 315239 2
249 TX 1310732 238546 6 0 315239 2
249 TM 9130 0 6 0 315239 2
249 SH 0 0 6 0 315239 0
537 FU 0 0 6 0 65780 2
537 TX 1245223 127849 6 0 65777 2
537 TM 8785 0 3 0 65777 2
已选择30行。
系统中还真是存在一些长时间的锁,检查一下对应的事务信息:
SQL> SELECT XIDUSN, XIDSLOT, XIDSQN, STATUS, START_TIME, FLAG
2 FROM V$TRANSACTION
3 ORDER BY START_TIME;
XIDUSN XIDSLOT XIDSQN STATUS START_TIME FLAG
---------- ---------- ---------- ---------------- -------------------- ----------
18 1 117887 ACTIVE 12/19/09 22:00:25 67116579
17 70 75160 ACTIVE 12/20/09 00:00:05 83891747
11 70 117862 ACTIVE 12/20/09 00:00:10 16782851
20 12 238546 ACTIVE 12/20/09 02:17:12 83893795
19 39 127849 ACTIVE 12/22/09 23:34:51 67116579
这些事务也存在很长时间了,检查V$TRANSACTION_ENQUEUE视图:
SQL> SELECT SID, TYPE, ID1, ID2, LMODE, REQUEST, CTIME, BLOCK
2 FROM V$TRANSACTION_ENQUEUE
3 ORDER BY CTIME DESC;
SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
261 TX 1179649 117887 6 0 331203 2
235 TX 1114182 75160 6 0 324023 2
517 TX 720966 117862 6 0 324018 2
249 TX 1310732 238546 6 0 315796 2
537 TX 1245223 127849 6 0 66334 2
根据运行时间的长短来判断,很可能问题出现在会话261上,检查会话信息:
SQL> SELECT SID, SERIAL#, PROGRAM, MODULE
2 FROM V$SESSION
3 WHERE SID = 261;
SID SERIAL# PROGRAM MODULE
---------- ---------- -------------------------------------- ----------------------------
261 23045 oracle@newtrade2 (m000) MMON_SLAVE
看了问题又是和M000进程有关,看看这个会话在等待什么:
SQL> SELECT SID, EVENT, P1TEXT, P1, P2TEXT, P2, P3TEXT, P3, SECONDS_IN_WAIT
2 FROM V$SESSION_WAIT
3 WHERE SID = 261;
SID EVENT P1TEXT P1 P2TEXT P2 P3TEXT P3 SECONDS_IN_WAIT
-------- ------------------ ------ ----- ------ ------- ------ -------- ---------------
261 gc current request file# 4 block# 63685 id# 33619976 331702
数据库在等待CLUSTER传送CACHE,怀疑可能是网络的瞬断导致的问题。
由于M000进程是轻量级进程,尝试在后台kill -9 spid的方式清楚进程:
SQL> SELECT SPID
2 FROM V$PROCESS P, V$SESSION S
3 WHERE P.ADDR = S.PADDR
4 AND SID = 261;
SPID
------------
14234
SQL> HOST kill -9 14234
检查剩下的几个持有锁的会话,发现分别是M000、M002、J004和J005进程。里面居然还有两个未完成的JOB,好在中止这些进程并不会造成系统问题,通过V$PROCESS查询这些进程对应SPID,并在操作系统中清除:
SQL> SELECT 'HOST kill -9 ' || SPID
2 FROM V$PROCESS
3 WHERE ADDR IN
4 (SELECT PADDR
5 FROM V$SESSION S, V$TRANSACTION_ENQUEUE TE
6 WHERE S.SID = TE.SID
7 AND TE.CTIME > 10000);
'HOSTKILL-9'||SPID
-------------------------
HOST kill -9 3122
HOST kill -9 21388
HOST kill -9 3124
HOST kill -9 5746
SQL> HOST kill -9 3122
SQL> HOST kill -9 21388
SQL> HOST kill -9 3124
SQL> HOST kill -9 5746
会话清除后,发现问题并没有完全解决。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-627301/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/4227/viewspace-627301/