一客户生产库于上午7点半左右出现拥堵,应用客户反映系统无法使用,像HANG住一样。现场工程师检查数据库发现大量的“cursor: pin S wait on X”和“library cache lock”等待事件。
前台应用堵塞,与“cursor: pin S wait on X”和“library cache lock”等待事件有一定关系,但本次故障是由计划任务GATHER_STATS_JOB所在会话自堵塞,使得资源未释放导致。下面是详细的分析过程:
正如下图红色字体所示,的确我们看到后台出现大量的librarycache lock、cursor: pin S wait on X等待事件。SQL> set linesize 200
set pagesize 100
SQL> SQL> col SAMPLE_TIME for a30
SQL> col event for a60
SQL> col PROGRAM for a20
SQL> select
2 event,count(*)
3 from
4 oracle_diag_hist
5 where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')
6 and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
7 group by event
8 order by count(*)
9 ;
EVENT COUNT(*)
------------------------------------------------------------ ----------
enq: JS - wdw op 1
enq: TX - allocate ITL entry 1
SQL*Net message to client 1
enq: RO - fast object reuse 1
direct path write temp 1
ges global resource directory to be frozen 1
latch free 1
JS kill job wait 1
enq: TC - contention 1
ges remote message 1
gc current block congested 1
enq: HW - contention 1
gc current block busy 2
local write wait 2
enq: XR - database force logging 2
gc current multi block request 2
enq: US - contention 2
enq: WL - contention 2
enq: FB - contention 2
SQL*Net more data from client 2
gc current grant busy 2
SQL*Net more data to dblink 2
resmgr:internal state change 3
latch: KCL gc element parent latch 3
direct path read 3
Backup: sbtclose2 4
gc current grant 2-way 4
Backup: sbtremove2 5
Streams AQ: qmn coordinator waiting for slave to start 6
gc current request 6
null event 6
gc cr block 2-way 6
enq: WF - contention 7
gc cr request 9
rdbms ipc reply 9
gc current block 2-way 10
enq: MS - contention 11
os thread startup 13
gcs drm freeze in enter server mode 15
SQL*Net more data from dblink 16
gc cr block busy 18
gc cr grant 2-way 19
CGS wait for IPC msg 23
Backup: sbtbackup 23
control file parallel write 25
gcs log flush sync 25
log file sync 26
row cache lock 27
reliable message 28
SQL*Net more data to client 30
log file sequential read 31
gc buffer busy 38
library cache pin 41
cr request retry 54
read by other session 63
control file sequential read 83
log file parallel write 98
inactive session 110
Log archive I/O 129
gc cr multi block request 166
db file parallel write 192
direct path read temp 195
db file parallel read 686
kksfbc child completion 946
DFS lock handle 1160
RMAN backup & recovery I/O 1274
latch: library cache 1486
db file scattered read 1581
Backup: sbtwrite2 1644
db file sequential read 1818
library cache lock 4989
17397
cursor: pin S wait on X 38742
73 rows selected.
SQL>
直接从这些等待事件入手查找原因是比较困难的,它们只所以等待是因为无法获取需要的handle、chunk等,这些都是内存中的东西,我们也无法理解。既然这样,换个查找方向。我们来看故障前后一段时间内,谁一直在阻塞其他会话?
select
BLOCKING_SESSION
,count(*)
from
oracle_diag_hist
where SAMPLE_TIME > to_date('2013-07-03:04:00:00','yyyy-mm-dd:hh24:mi:ss')
and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
group by BLOCKING_SESSION
;
BLOCKING_SESSION COUNT(*)
---------------- ----------
57743
73 2115
2112 84
517 3
1425 1
531 32
549 1
248 1
514 1
168 1
314 1
11 rows selected.
SQL>
通过上面的SQL,我们能看到session 73和2112阻塞其他会话的次数最多,很可疑。这些会话为什么会阻塞其他会话?难道他被阻塞了?那我们继续查谁阻塞了这些会话。
SQL> col SAMPLE_TIME for a40
SQL> col event for a30
SQL> select
2 session_id,SAMPLE_TIME,event,sql_id,blocking_session
3 from
4 oracle_diag_hist
5 where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')
6 and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
7 and session_id = &session_id
8 order by SAMPLE_TIME
9 ;
SQL> /
Enter value for session_id: 73
old 7: and session_id = &session_id
new 7: and session_id = 73
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM
---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------
73 03-JUL-13 01.00.01.983 AM direct path read temp 4bzt4j1n3kbg1 1459945305 oracle@essdb5 (J002)
73 03-JUL-13 01.00.12.103 AM 4bzt4j1n3kbg1 1459945305 oracle@essdb5 (J002)
73 03-JUL-13 01.00.22.213 AM 4bzt4j1n3kbg1 1459945305 oracle@essdb5 (J002)
73 03-JUL-13 01.00.32.313 AM 4bzt4j1n3kbg1 1459945305 oracle@essdb5 (J002)
73 03-JUL-13 01.00.42.423 AM 4bzt4j1n3kbg1 1459945305 oracle@essdb5 (J002)
73 03-JUL-13 01.00.52.533 AM direct path read temp 4bzt4j1n3kbg1 1459945305 oracle@essdb5 (J002)
::::::::::::::::::::::::::::::::::
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM
---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------
73 03-JUL-13 05.59.33.376 AM latch: library cache 0 oracle@essdb5 (J002)
73 03-JUL-13 05.59.43.476 AM 4y1y43113gv8f 1333736285 oracle@essdb5 (J002)
73 03-JUL-13 05.59.53.586 AM latch: library cache b6usrg82hwsa3 0 oracle@essdb5 (J002)
73 03-JUL-13 06.00.03.706 AM DFS lock handle b6usrg82hwsa3 0 oracle@essdb5 (J002)
73 03-JUL-13 06.00.13.806 AM latch: library cache 0 oracle@essdb5 (J002)
73 03-JUL-13 06.00.23.916 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.00.34.022 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.00.44.122 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.00.54.222 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.01.04.342 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.01.14.442 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM
---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------
73 03-JUL-13 06.01.24.552 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.01.34.662 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.01.44.762 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.01.54.862 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 06.02.04.982 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
::::::::::::::::::::::::::::::::::
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM
---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------
73 03-JUL-13 08.57.25.845 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.57.35.955 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.57.46.055 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.57.56.165 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.58.06.285 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.58.16.385 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.58.26.495 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.58.36.605 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.58.46.705 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.58.56.815 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.59.06.931 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM
---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------
73 03-JUL-13 08.59.17.041 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
73 03-JUL-13 08.59.27.141 AM DFS lock handle 8r7dm0qvp2259 73 0 oracle@essdb5 (J002)
2847 rows selected.
SQL>
我们可以看到,会话73 6:00之后被自己阻塞,经过检查这个自阻塞一直持续到数据库被重启,这是不正常的。但该会话在6:00之前是正常的,那我们再来看一下谁阻塞了会话2112。
SQL> col SAMPLE_TIME for a40
SQL> col event for a30
SQL> select
2 session_id,SAMPLE_TIME,event,sql_id,blocking_session
3 from
4 oracle_diag_hist
5 where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')
6 and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
7 and session_id = &session_id
8 order by SAMPLE_TIME
9 ;
Enter value for session_id: 2112
old 7: and session_id = &session_id
new 7: and session_id = 2112
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION
---------- ---------------------------------------- ------------------------------ ------------- ----------------
2112 03-JUL-13 07.36.29.933 AM db file sequential read fyjpq29hka5bn
2112 03-JUL-13 07.46.56.595 AM fbf927kwsussx
2112 03-JUL-13 07.47.06.705 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.47.16.804 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.47.26.904 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.47.37.014 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.47.47.114 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.47.57.234 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.48.07.344 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.48.17.444 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.48.27.554 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.48.37.654 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.48.47.754 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.48.57.874 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.49.07.984 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.49.18.082 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.49.28.182 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.49.38.282 AM cursor: pin S wait on X 27qc21g5nsc87
2112 03-JUL-13 07.49.48.382 AM cursor: pin S wait on X 27qc21g5nsc87
::::::::
SESSION_ID SAMPLE_TIME EVENT SQL_ID BLOCKING_SESSION
---------- ---------------------------------------- ------------------------------ ------------- ----------------
2112 03-JUL-13 08.52.08.121 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.52.18.221 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.52.28.331 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.52.38.441 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.52.48.541 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.52.58.661 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.53.08.771 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.53.18.881 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.53.28.979 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.53.39.089 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.53.49.209 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.53.59.309 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.54.09.419 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.54.19.529 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.54.29.629 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.54.39.739 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.54.49.859 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.54.59.959 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.55.10.069 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.55.20.179 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.55.30.277 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.55.40.387 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.55.50.507 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.56.00.607 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.56.10.717 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.56.20.827 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.56.30.937 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.56.41.047 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.56.51.167 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.57.01.267 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.57.11.377 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.57.21.487 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.57.31.585 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.57.41.695 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.57.51.815 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.58.01.915 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.58.12.025 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.58.22.135 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.58.32.235 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.58.42.345 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.58.52.475 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.59.02.575 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.59.12.685 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.59.22.795 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.59.32.895 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.59.43.004 AM library cache lock 27qc21g5nsc87
2112 03-JUL-13 08.59.53.124 AM library cache lock 27qc21g5nsc87
435 rows selected.
SQL>
从上图我们可以看到会话2112在7点36分出现等待事件“db file sequential read”,这是正常的。但从7点47分开始等待“cursor:pin S wait on X”直到8点52分转换成“library cache lock”,注意这里的sql_id始终没变,这是不正常的。
通过两个阻塞严重的会话73、2112可以看出,2112堵塞的时间晚于73,很有可能是会话73的堵塞使得2112堵塞。那我们再来看看73阻塞了谁?SQL> col SAMPLE_TIME for a30
SQL> col event for a30
SQL> col PROGRAM for a20
SQL> select
2 session_id,event,count(*)
3 from
4 oracle_diag_hist
5 where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')
6 and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
7 and blocking_session = &session_id
8 group by session_id,event
9 ;
Enter value for session_id: 73
old 7: and blocking_session = &session_id
new 7: and blocking_session = 73
SESSION_ID EVENT COUNT(*)
---------- ------------------------------ ----------
508 row cache lock 3
73 DFS lock handle 1063
168 library cache lock 663
981 library cache lock 386
SQL>
会话73自阻塞,为什么73会自阻塞呢?从上面我们可以看到73阻塞是从03-JUL-13 06.00.23.916开始的,并且73是JOB,那我们来检查一下系统JOB运行状况,见下图:
SQL> col state for a10
SQL> SELECT owner,job_name,state,last_start_date,last_run_duration,failure_count
2 FROM dba_scheduler_jobs;
OWNER JOB_NAME STATE LAST_START_DATE LAST_RUN_DURATION FAILURE_COUNT
---------- ------------------------- ---------- ------------------------------------ ------------------------------ -------------
SYS AUTO_SPACE_ADVISOR_JOB SCHEDULED 03-JUL-13 10.00.02.060295 PM +08:00 +000000000 01:05:55.361158 3
SYS GATHER_STATS_JOB SCHEDULED 03-JUL-13 10.00.02.053605 PM +08:00 +000000000 07:59:58.933113 0
SYS FGR$AUTOPURGE_JOB DISABLED 0
SYS PURGE_LOG SCHEDULED 04-JUL-13 03.00.00.175794 AM +08:00 +000000000 00:00:00.415187 0
ORACLE_OCM MGMT_STATS_CONFIG_JOB SCHEDULED 01-JUL-13 01.01.01.554056 AM +08:00 +000000000 00:00:12.660441 0
ORACLE_OCM MGMT_CONFIG_JOB SCHEDULED 03-JUL-13 10.00.01.997028 PM +08:00 +000000000 00:00:00.320530 0
EXFSYS RLM$SCHDNEGACTION SCHEDULED 04-JUL-13 05.18.07.107427 PM +08:00 +000000000 00:00:00.215880 0
EXFSYS RLM$EVTCLEANUP SCHEDULED 22-OCT-12 05.35.21.909816 PM +08:00 +000000000 00:02:13.685219 1
8 rows selected.
SQL> col job_name for a20
SQL> col status for a20
SQL> col last_start_date for a36
SQL> SELECT log_id, job_name, status,
2 TO_CHAR (log_date, 'DD-MON-YYYY HH24:MI') log_date#,actual_start_date
3 FROM dba_scheduler_job_run_details
4 WHERE job_name = 'GATHER_STATS_JOB' order by actual_start_date;
LOG_ID JOB_NAME STATUS LOG_DATE# ACTUAL_START_DATE
---------- -------------------- -------------------- -------------------- ---------------------------------------------------------------------------
173057 GATHER_STATS_JOB STOPPED 04-JUN-2013 06:00 03-JUN-13 10.00.12.702279 PM +08:00
173105 GATHER_STATS_JOB SUCCEEDED 05-JUN-2013 05:12 04-JUN-13 10.00.03.553269 PM +08:00
173186 GATHER_STATS_JOB STOPPED 06-JUN-2013 06:00 05-JUN-13 10.00.04.671089 PM +08:00
173246 GATHER_STATS_JOB SUCCEEDED 07-JUN-2013 03:58 06-JUN-13 10.00.01.713362 PM +08:00
173307 GATHER_STATS_JOB STOPPED 08-JUN-2013 06:00 07-JUN-13 10.00.07.335319 PM +08:00
173289 GATHER_STATS_JOB SUCCEEDED 08-JUN-2013 07:49 08-JUN-13 06.00.11.991233 AM +08:00
173429 GATHER_STATS_JOB STOPPED 11-JUN-2013 06:00 10-JUN-13 10.00.02.020636 PM +08:00
173459 GATHER_STATS_JOB STOPPED 12-JUN-2013 06:00 11-JUN-13 10.00.10.844326 PM +08:00
173488 GATHER_STATS_JOB STOPPED 13-JUN-2013 06:00 12-JUN-13 10.00.08.097010 PM +08:00
173518 GATHER_STATS_JOB STOPPED 14-JUN-2013 06:00 13-JUN-13 10.00.02.130179 PM +08:00
173589 GATHER_STATS_JOB STOPPED 15-JUN-2013 06:00 14-JUN-13 10.00.02.825754 PM +08:00
LOG_ID JOB_NAME STATUS LOG_DATE# ACTUAL_START_DATE
---------- -------------------- -------------------- -------------------- ---------------------------------------------------------------------------
173595 GATHER_STATS_JOB SUCCEEDED 15-JUN-2013 07:33 15-JUN-13 06.00.12.296982 AM +08:00
173686 GATHER_STATS_JOB STOPPED 18-JUN-2013 06:00 17-JUN-13 10.00.03.440121 PM +08:00
173736 GATHER_STATS_JOB STOPPED 19-JUN-2013 06:00 18-JUN-13 10.00.01.361487 PM +08:00
173808 GATHER_STATS_JOB STOPPED 20-JUN-2013 06:00 19-JUN-13 10.00.01.680778 PM +08:00
173836 GATHER_STATS_JOB STOPPED 21-JUN-2013 06:00 20-JUN-13 10.00.04.455498 PM +08:00
173905 GATHER_STATS_JOB STOPPED 22-JUN-2013 06:00 21-JUN-13 10.00.06.883596 PM +08:00
173886 GATHER_STATS_JOB SUCCEEDED 22-JUN-2013 16:30 22-JUN-13 06.00.30.956798 AM +08:00
174012 GATHER_STATS_JOB STOPPED 25-JUN-2013 06:00 24-JUN-13 10.00.01.468786 PM +08:00
174046 GATHER_STATS_JOB STOPPED 26-JUN-2013 06:00 25-JUN-13 10.00.02.324651 PM +08:00
174114 GATHER_STATS_JOB STOPPED 27-JUN-2013 06:00 26-JUN-13 10.00.01.509654 PM +08:00
174147 GATHER_STATS_JOB STOPPED 28-JUN-2013 06:00 27-JUN-13 10.00.02.731738 PM +08:00
LOG_ID JOB_NAME STATUS LOG_DATE# ACTUAL_START_DATE
---------- -------------------- -------------------- -------------------- ---------------------------------------------------------------------------
174194 GATHER_STATS_JOB STOPPED 29-JUN-2013 06:00 28-JUN-13 10.00.04.998374 PM +08:00
174208 GATHER_STATS_JOB SUCCEEDED 29-JUN-2013 12:27 29-JUN-13 06.00.04.335794 AM +08:00
174341 GATHER_STATS_JOB STOPPED 02-JUL-2013 06:00 01-JUL-13 10.00.01.313663 PM +08:00
174422 GATHER_STATS_JOB STOPPED 03-JUL-2013 10:10 02-JUL-13 10.00.01.866622 PM +08:00
174469 GATHER_STATS_JOB STOPPED 04-JUL-2013 06:00 03-JUL-13 10.00.01.981592 PM +08:00
27 rows selected.
从上图红色字体标注位置可以看出GATHER_STATS_JOB运行了8小时才结束,且未为异常结束。我们都知道,这个job默认是从晚上10点开始,如果早上6点还没有执行完毕,系统会终止,从这个时间点来看,说明这个JOB是被系统终止的。
那么针对大量事件“cursor: pin S wait on X”和“library cache lock”的产生就有两种推论:
推论1,是会话73运行在7月2日晚上10点开始运行job GATHER_STATS_JOB,7月3日凌晨6点还没运行完,被系统终止,但此时会话73出现自阻塞现象,资源没有释放,导致后来的会话拥堵,出现“cursor: pin Swait on X”和“library cache lock”内存中一些资源无法释放。如果推论成立,那么这两个事件的出现时间应该晚于6点。
推论2,是系统在这个时间段本身就会有一些“cursor: pin Swait on X”和“library cache lock”事件,后因未即使处理堆积引起。
针对推论2,我查过7月1日、2日相同时间段的等待事件,未发现“cursor:pin S wait on X”和“library cache lock”等待事件,基本可以论证不是因为突发的事件堆积引起。
针对推论1,我们来检查这两个事件发生的时间点:SQL> select
min(SAMPLE_TIME)
2 3 from
4 oracle_diag_hist
5 where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')
6 and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
7 and event = 'cursor: pin S wait on X'
8 ;
MIN(SAMPLE_TIME)
---------------------------------------------------------------------------
03-JUL-13 07.08.46.880 AM
SQL>
SQL> col PROGRAM for a30
SQL> select
2 SAMPLE_TIME,session_id,event,blocking_session,PROGRAM
3 from
4 oracle_diag_hist
5 where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')
6 and SAMPLE_TIME < to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')
7 and event = 'library cache lock'
8 order by SAMPLE_TIME
9 ;
SAMPLE_TIME SESSION_ID EVENT BLOCKING_SESSION PROGRAM
------------------------------ ---------- ------------------------------ ---------------- ------------------------------
03-JUL-13 01.00.12.103 AM 998 library cache lock datatofile@XHM-3ES- (TNS V1-V3)
03-JUL-13 01.15.11.768 AM 527 library cache lock oracle@essdb6 (CJQ0)
03-JUL-13 01.48.33.296 AM 574 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 01.49.03.590 AM 1406 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 01.49.03.590 AM 1526 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 01.49.03.636 AM 554 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 01.49.03.636 AM 2037 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 02.00.20.845 AM 1987 library cache lock rman@essdb5 (TNS V1-V3)
03-JUL-13 02.17.11.497 AM 527 library cache lock oracle@essdb6 (CJQ0)
03-JUL-13 03.05.52.789 AM 1058 library cache lock
03-JUL-13 03.27.26.550 AM 210 library cache lock oracle@essdb3 (TNS V1-V3)
03-JUL-13 03.34.00.749 AM 445 library cache lock oracle@essdb5 (J007)
03-JUL-13 03.57.54.474 AM 527 library cache lock oracle@essdb6 (CJQ0)
03-JUL-13 04.04.19.862 AM 726 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 04.04.19.862 AM 1425 library cache lock oracle@essdb1 (TNS V1-V3)
03-JUL-13 04.11.12.802 AM 1527 library cache lock
03-JUL-13 04.19.48.146 AM 527 library cache lock oracle@essdb6 (CJQ0)
03-JUL-13 04.26.44.162 AM 1083 library cache lock sqlldr@XHM-3ES- (TNS V1-V3)
03-JUL-13 04.57.13.688 AM 73 library cache lock 1425 oracle@essdb5 (J002)
03-JUL-13 05.09.21.366 AM 1785 library cache lock sqlldr@XHM-3ES- (TNS V1-V3)
03-JUL-13 05.33.16.602 AM 750 library cache lock oracle@essdb3 (TNS V1-V3)
03-JUL-13 06.01.51.732 AM 886 library cache lock
03-JUL-13 06.02.01.832 AM 886 library cache lock
03-JUL-13 06.02.11.932 AM 886 library cache lock
03-JUL-13 06.02.22.032 AM 886 library cache lock
03-JUL-13 06.02.32.132 AM 886 library cache lock
03-JUL-13 06.02.42.242 AM 886 library cache lock
03-JUL-13 06.02.52.362 AM 886 library cache lock
03-JUL-13 06.03.02.462 AM 886 library cache lock
:::::::
03-JUL-13 07.07.01.599 AM 886 library cache lock
03-JUL-13 07.07.11.697 AM 886 library cache lock
03-JUL-13 07.07.15.933 AM 168 library cache lock 73 JDBC Thin Client
03-JUL-13 07.07.21.797 AM 886 library cache lock
03-JUL-13 07.07.26.043 AM 168 library cache lock 73 JDBC Thin Client
03-JUL-13 07.07.31.897 AM 886 library cache lock
03-JUL-13 07.07.36.153 AM 168 library cache lock 73 JDBC Thin Client
03-JUL-13 07.07.42.007 AM 886 library cache lock
03-JUL-13 07.07.46.253 AM 168 library cache lock 73 JDBC Thin Client
03-JUL-13 07.07.52.127 AM 886 library cache lock
03-JUL-13 07.07.56.353 AM 168 library cache lock 73 JDBC Thin Client
03-JUL-13 07.08.02.227 AM 886 library cache lock
::::
03-JUL-13 08.45.54.136 AM 759 library cache lock JDBC Thin Client
03-JUL-13 08.45.54.136 AM 831 library cache lock JDBC Thin Client
03-JUL-13 08.45.54.136 AM 886 library cache lock 2112
03-JUL-13 08.45.54.136 AM 915 library cache lock JDBC Thin Client
03-JUL-13 08.45.54.136 AM 1512 library cache lock JDBC Thin Client
03-JUL-13 08.45.54.136 AM 1913 library cache lock JDBC Thin Client
从上图中红色字体标注部分可以看出事件“cursor: pin S wait on X”最早出现的时间为8点46分,晚于会话73的自堵塞时间点。事件“library cache lock”在5点33分之前就有,但都比较短暂,而在6点01分后才频繁出现,最后还出现一些会话又被会话73堵塞,频繁出现的时间点也晚于会话73的自堵塞时间点。因此推论1基本成立,本次故障是由会话73死亡后资源未释放,产生自阻塞导致。
我们都知道在ORACLE中清理死亡会话的工作是由PMON进程负责的,至于本次故障时刻,是否是会话73已死亡而PMON进程没有去清理该死亡进程资源未释放导致?还是会话73本身就是活动的,此时只是因为无法得到“DFS lock handle”而自锁导致,仅根据现有的信息是无法得知的,还需要在故障时刻收集一些额外的信息才可能理清原委。从目前仅有的信息来看,该故障很可能是由会话73的自堵塞导致。
应急办法:
故障时刻尝试kill对应的job进程,例如:kill -9 “会话73对应的PID”,再观察数据库状态。