前几日一套库出现在性能问题,虽然最终问题不在数据库, 但是记录一下希望遇到同样问题时可以节约你的时间 ,这是11.2.0.3 RAC 2nodes on hpux (EMC存储), 问题是从16:40左右起中间件偶尔有瞬间的业务积压, 积压时数据库大多数会话都是简单的insert同一表数据 ,后来大概停止了该批处理,数据库后来没有再出明显积压,因为当时反馈给我是周末晚上而且后来没有再起该批业务,只是中间件日志有提示会话被kill的错误,想了解一下会话被kill原因,同事传了我一段db alert日志如下:
# DB ALERT
Sun Jul 19 16:32:11 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc (incident=512083):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_dia0_2471_i512083.trc
Sun Jul 19 16:32:13 2015
Sweep [inc][512083]: completed
Sun Jul 19 16:32:13 2015
Sweep [inc2][512083]: completed
System State dumped to trace file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_m000_29275_i512083_a.trc
DIA0 terminating blocker (ospid: 4073 sid: 7829 ser#: 4119) of hang with ID = 223
requested by master DIA0 process on instance 1
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating the process
DIA0 successfully terminated process ospid:4073.
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc (incident=512084):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512084/weejar2_dia0_2471_i512084.trc
Sun Jul 19 16:32:34 2015
Sweep [inc][512084]: completed
Sun Jul 19 16:32:35 2015
DIA0 terminating blocker (ospid: 16104 sid: 2287 ser#: 58237) of hang with ID = 225
requested by master DIA0 process on instance 1
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid: 2287 ospid: 16104
DIA0 successfully terminated session sid:2287 ospid:16104 with status 31.
Sun Jul 19 16:33:13 2015
Sweep [inc2][512084]: completed
Sun Jul 19 16:33:50 2015
WARN: ARC2: Terminating pid 4793 hung on an I/O operation
Sun Jul 19 16:34:13 2015
krsv_proc_kill: Killing 1 processes (Process by index)
ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_1 #<<<<<<<<<<<<<<
ARC2: Detected ARCH process failure
ARC2: STARTING ARCH PROCESSES
Sun Jul 19 16:34:15 2015
ARC0 started with pid=34, OS id=3953
Sun Jul 19 16:34:15 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc (incident=512085):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512085/weejar2_dia0_2471_i512085.trc
ARC0: Archival started
ARC2: STARTING ARCH PROCESSES COMPLETE
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance weejar2 - Archival Error
ORA-16014: log 11 sequence# 12803 not archived, no available destinations
ORA-00312: online log 11 thread 2: '/dev/anbob_oravg02/ranbob_redo11'
Sun Jul 19 16:34:17 2015
Sweep [inc][512085]: completed
...
... here had truncated
...
DISTRIB TRAN 00000028.716D7979313000000000000000000000000000000000000000000000000000000006D2A955A7E4B002ECC089
is local tran 414.10.43815 (hex=19e.0a.ab27))
delete pending collecting tran, scn=14464580264119 (hex=d27.cc2b20b7)
DISTRIB TRAN 00000028.716D7979313000000000000000000000000000000000000000000000000000000006D2A955A7E4B002ECC08E
is local tran 125.19.296884 (hex=7d.13.487b4))
delete pending collecting tran, scn=14464580264148 (hex=d27.cc2b20d4)
Sun Jul 19 17:32:17 2015
Thread 2 advanced to log sequence 12808 (LGWR switch)
Current log# 10 seq# 12808 mem# 0: /dev/anbob_oravg02/ranbob_redo10
Sun Jul 19 17:32:25 2015
Archived Log entry 22618 added for thread 2 sequence 12807 ID 0x17418ab3 dest 1:
Sun Jul 19 17:37:00 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_16512.trc (incident=512088):
ORA-32701: Possible hangs up to hang ID=0 detected
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512088/weejar2_dia0_16512_i512088.trc
Sun Jul 19 17:37:02 2015
Sweep [inc][512088]: completed
Sun Jul 19 17:37:02 2015
Sweep [inc2][512088]: completed
System State dumped to trace file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512088/weejar2_m000_10298_i512088_a.trc
DIA0 terminating blocker (ospid: 29778 sid: 9592 ser#: 20163) of hang with ID = 374
requested by master DIA0 process on instance 1
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid: 9592 ospid: 29778
DIA0 successfully terminated session sid:9592 ospid:29778 with status 31.
Sun Jul 19 17:48:15 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_ora_7290.trc (incident=516793):
ORA-00600: internal error code, arguments: [qerltcUserIterGet_1], [56], [56], [], [], [], [], [], [], [], [], []
ORA-24761: transaction rolled back
Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_516793/weejar2_ora_7290_i516793.trc
Sun Jul 19 17:48:15 2015
Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_ora_5983.trc (incident=525073):
ORA-00600: internal error code, arguments: [qerltcUserIterGet_1], [56], [56], [], [], [], [], [], [], [], [], []
ORA-24761: transaction rolled back
NOTE:
注意alert中从那个时间点的确出现了ORA-32701 hang的进程, 而且从下面一段要以判断当时应该是有I/O 高负载或者异常,因为日志中只出现一次,没有再深究
“WARN: ARC2: Terminating pid 4793 hung on an I/O operation
Sun Jul 19 16:34:13 2015
krsv_proc_kill: Killing 1 processes (Process by index)
ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_1
ARC2: Detected ARCH process failure”
不过看到了DIA0进程的解有终止进程,dia0进程是HM(Hang manger)的后台进程,是11G 新特性,负责定期在RAC 环境中收集进程hang的信息,或可以自行解决一些hang进程,受隐藏参数的影响, 了解HM 可以自己去官网查阅。而且后来的分布 式事务回滚, 及ora-600 [qerltcUserIterGet_1] ORA-24761 也是受会话hang ,HM kill后受到的影响。
# hang trace file
Dump file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_dia0_2471_i512083.trc
..
*** 2015-07-19 16:32:11.494
Resolvable Hangs in the System
Root Chain Total Hang
Hang Hang Inst Root #hung #hung Hang Hang Resolution
ID Type Status Num Sess Sess Sess Conf Span Action
----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
223 HANG RSLNPEND 2 7829 2 19 HIGH LOCAL Terminate Process
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
inst# SessId Ser# OSPID PrcNm Event
----- ------ ----- --------- ----- -----
2 300 2745 5568 FG read by other session
2 7829 4119 4073 FG db file sequential read
Dumping process info of pid[2767.4073] (sid:7829, ser#:4119)
requested by master DIA0 process on instance 1.
*** 2015-07-19 16:32:11.494
Process diagnostic dump for oracle@qdanbob2, OS id=4073,
pid: 2767, proc_ser: 21, sid: 7829, sess_ser: 4119
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 16:32:11 ]
NOTE: scheduling delay has not been sampled for 0.224630 secs 0.000657 secs from [ 16:32:07 - 16:32:12 ], 5 sec avg
0.000455 secs from [ 16:31:12 - 16:32:12 ], 1 min avg
0.000284 secs from [ 16:27:11 - 16:32:12 ], 5 min avg
*** 2015-07-19 16:32:11.907
loadavg : 0.17 0.16 0.17
Swapinfo :
Avail = 504818.16Mb Used = 170655.53Mb
Swap free = 334112.25Mb Kernel rsvd = 17269.27Mb
Free Mem = 241739.08Mb
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
3401 S grid 4073 1 0 178 20 e000000feb8d5680 97832 e00000103c925097 Jul 10 ? 13:13 oracleweejar2 (LOCAL=NO)
Short stack dump:
ksedsts()+544
0
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_1 string location=/anbob2_arch
oracle@qdanbob2:/home/oracle> mount
...
/anbob2_arch on /dev/anbob_arch/fslvanbob_arch2 ioerror=mwdisable,largefiles,delaylog,dev=40120002 on Fri Jun 19 00:48:11 2015
/anbob1_arch on qdanbob1:/anbob1_arch noac,forcedirectio,rsize=32768,wsize=32768,NFSv3,dev=4000003 on Fri Jun 19 00:58:59 2015
SQL>@p hang
NAME VALUE
---------------------------------------- ----------------------------------------
_ksv_pool_hang_kill_to 0
_hang_analysis_num_call_stacks 3
_local_hang_analysis_interval_secs 3
_hang_msg_checksum_enabled TRUE
_hang_detection_enabled TRUE #<<<<<<<<<<< HM enable
_hang_detection_interval 32
_hang_resolution_scope PROCESS #<<<<<<<<<<<< 表示HM 会自己终止HANG PROCESS(非后台process)
_hang_resolution_policy HIGH
_hang_resolution_confidence_promotion FALSE
_hang_resolution_global_hang_confidence_ FALSE
...
37 rows selected.
Note:
该RAC 是启用了HM, 而且HM 可以自己kill一些非核心的后台进程, 从hang trace 发现了存在root blocker I/O 的等待导致了进程的hang, 执行的SQL是用户的insert 进程,这样就找到了会话被kill的凶手,另外注意上面db alert中归档出错的路径是本地的存储路径(非NFS)
第二天中午一来9点时,发现业务又开始了积压, 查询到还是昨天的insert, 查询了一下该SQL近几天的业务量
Per-Plan Execution Statistics Over Time
Avg Avg
Plan Snapshot Avg LIO Avg PIO CPU (secs) Elapsed (secs)
Hash Value Time Execs Per Exec Per Exec Per Exec Per Exec
---------- ------------ -------- ------------------- ------------------- ------------------- -------------------
18-JUL 07:00 81,954 27.60 0.68 0.00 0.01
18-JUL 08:00 164,767 27.44 0.52 0.00 0.00
18-JUL 09:00 148,988 27.89 0.57 0.00 0.00
18-JUL 10:00 152,462 27.89 0.55 0.00 0.00
18-JUL 11:00 133,345 27.51 0.57 0.00 0.00
18-JUL 12:00 144,135 27.71 0.57 0.00 0.00
18-JUL 13:00 126,773 27.66 0.64 0.00 0.00
18-JUL 14:00 134,799 27.53 0.53 0.00 0.00
18-JUL 15:00 125,364 27.89 0.57 0.00 0.00
18-JUL 16:00 129,116 27.79 0.56 0.00 0.00
18-JUL 17:00 131,873 27.86 0.58 0.00 0.00
18-JUL 18:00 125,614 27.64 0.55 0.00 0.00
18-JUL 19:00 126,923 27.77 0.56 0.00 0.00
18-JUL 20:00 123,668 27.86 0.60 0.00 0.00
18-JUL 21:00 112,380 27.58 0.52 0.00 0.00
18-JUL 22:00 92,602 27.59 0.43 0.00 0.00
18-JUL 23:00 53,483 27.53 0.43 0.00 0.00
19-JUL 06:00 39,651 27.83 0.69 0.00 0.01
19-JUL 07:00 75,884 27.77 0.69 0.00 0.01
19-JUL 08:00 172,000 27.51 0.54 0.00 0.00
19-JUL 09:00 148,931 27.67 0.60 0.00 0.00
19-JUL 10:00 153,295 27.64 0.54 0.00 0.00
19-JUL 11:00 125,487 27.73 0.58 0.00 0.00
19-JUL 12:00 136,311 27.84 0.59 0.00 0.00
19-JUL 13:00 114,741 27.52 0.63 0.00 0.00
19-JUL 14:00 104,638 27.64 0.55 0.00 0.00
19-JUL 15:00 134,834 27.65 0.57 0.00 0.00
19-JUL 16:00 124,629 27.72 0.57 0.00 0.39
19-JUL 17:01 134,321 28.35 0.62 0.00 0.39
19-JUL 18:00 125,256 27.75 0.56 0.00 0.14
19-JUL 19:00 126,852 28.61 0.61 0.00 0.32
19-JUL 20:00 133,179 28.30 0.62 0.00 0.13
19-JUL 21:00 109,673 28.19 0.59 0.00 0.21
19-JUL 22:00 84,624 27.73 0.49 0.00 0.07
19-JUL 23:00 47,534 27.51 0.48 0.00 0.05
20-JUL 00:00 20,786 28.88 0.60 0.00 1.87
20-JUL 01:00 20,900 27.69 0.72 0.00 0.50
20-JUL 02:00 8,399 28.67 0.77 0.00 1.34
20-JUL 03:00 7,174 28.26 0.75 0.00 0.35
20-JUL 04:00 7,798 38.87 0.72 0.00 6.86
20-JUL 05:00 16,600 30.29 0.74 0.00 1.59
20-JUL 06:00 41,540 28.78 0.72 0.00 1.46
20-JUL 07:00 77,297 28.04 0.72 0.00 0.72
20-JUL 08:00 495,309 30.34 0.53 0.00 0.85 #<<<<<<<<<<<<<<<<<<
********** -------- ------------------- ------------------- ------------------- -------------------
avg 26.42 0.50 0.00 0.06
sum ########
Note:
从AWR DBA_HIST_SQL_STATS中可以确认当前8-9点之间exec执行数是比一前多了30万次,而且单次执行的响应时间是从昨天的16:00的快照看变长不到1秒,通知了存储工程师去确认存储是否正常,找业务确认后是他们又在批量执行, 当时判断应该还是量变引起的质变,又让他们先停止了该业务,其实后来发现只是该业务掩盖真实的原因,也正是因为这个大量insert和commit,才让问题表现出来。
但是在9:00 多后停止了批量后,10:00 又出同了短时的业务积压。
SQL> create table ash0720 tablespace users as select * from v$active_session_history ;
Table created.
SQL> select * from (
2 select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime desc) rn
3 from (
4 select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*) dbtime
5 from ash0720
6 --where sample_time between to_date('2015-3-18 10:00','yyyy-mm-dd hh24:mi') and to_date('2015-3-18 11:00','yyyy-mm-dd hh24:mi')
7 group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event
8 )
9 ) where rn<=10;
ETIME EVENTS DBTIME PCT RN
-------------- ---------------------------------------------- ---------- ---------- ----------
20150720 10:20 log file sync 19782 33 1
20150720 10:20 gc buffer busy release 12219 20.39 2
20150720 10:20 buffer busy waits 9621 16.05 3
20150720 10:20 enq: TX - index contention 7034 11.74 4
20150720 10:20 enq: SQ - contention 6303 10.52 5
20150720 10:20 read by other session 1153 1.92 6
20150720 10:20 db file sequential read 1061 1.77 7
20150720 10:20 inactive transaction branch 971 1.62 8
20150720 10:20 enq: TX - row lock contention 832 1.39 9
20150720 10:20 on cpu 219 .37 10
20150720 10:30 enq: TX - index contention 93536 43.91 1 #<<<<<<<<<<<<<<<
20150720 10:30 db file sequential read 47243 22.18 2
20150720 10:30 read by other session 21236 9.97 3
20150720 10:30 log file sync 12610 5.92 4
20150720 10:30 gc buffer busy release 7793 3.66 5
20150720 10:30 buffer busy waits 6061 2.85 6
20150720 10:30 enq: SQ - contention 5360 2.52 7
20150720 10:30 on cpu 5332 2.5 8
20150720 10:30 enq: TX - row lock contention 4271 2 9
20150720 10:30 inactive transaction branch 2698 1.27 10
20150720 10:40 read by other session 26429 40.58 1
20150720 10:40 db file sequential read 24469 37.57 2
20150720 10:40 on cpu 5291 8.12 3
20150720 10:40 db file parallel write 2136 3.28 4
20150720 10:40 SQL*Net message from dblink 1162 1.78 5
20150720 10:40 inactive transaction branch 1097 1.68 6
20150720 10:40 SQL*Net more data from dblink 1054 1.62 7
20150720 10:40 inactive session 698 1.07 8
20150720 10:40 enq: TX - row lock contention 540 .83 9
20150720 10:40 db file scattered read 472 .72 10
20150720 10:50 db file sequential read 24371 50.1 1
20150720 10:50 read by other session 11212 23.05 2
20150720 10:50 on cpu 5499 11.3 3
20150720 10:50 db file parallel write 1817 3.74 4
20150720 10:50 enq: TX - index contention 1596 3.28 5
20150720 10:50 SQL*Net more data from dblink 1191 2.45 6
20150720 10:50 SQL*Net message from dblink 976 2.01 7
20150720 10:50 inactive transaction branch 276 .57 8
20150720 10:50 enq: TX - row lock contention 246 .51 9
20150720 10:50 inactive session 196 .4 10
...
Note:
20150720 10:30 到10:40出现了负载压力大情况
SQL> select * from (
2 SELECT
3 h.event "Wait Event",
4 SUM(h.wait_time + h.time_waited)/1000000 "Total Wait Time"
5 FROM
6 ash0720 h,
7 v$event_name e
8 WHERE
9 sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
10 AND h.event_id = e.event_id
11 AND e.wait_class <>'Idle'
12 GROUP BY h.event
13 ORDER BY 2 DESC)
14 where rownum <10;
Wait Event Total Wait Time
---------------------------------------------------------------- ---------------
enq: TX - index contention 78846.1543
log file sync 56877.8177
db file sequential read 43827.3883
read by other session 23515.1048
enq: SQ - contention 10085.8243
gc buffer busy release 9988.80736
buffer busy waits 6469.06895
enq: TX - row lock contention 4129.69438
inactive transaction branch 2714.48481
9 rows selected.
Note:
从top event看应该还是与insert相关,比如enq:sq ,enq: tx, log file sync,BBS 还有两个I/O 明显的事件,gc buffer busy release通常是因log file write 慢或网络问题相关,当然这些还是我们的经验主观判断,要用数据去分析问题。
接下来就以’enq: TX – index contention’查找的wait chain
SQL> select sql_id,count(*)
2 from ash0720
3 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
4 and event='enq: TX - index contention' group by sql_id order by 2 desc;
SQL_ID COUNT(*)
-------------------- ----------
b1kh15rh5d2h9 33113
d71tkyh0b2q1y 21026
8nhv049t5zzh6 12854
2p8uq2t4b5v6r 8554
29rwby2p0367t 2786
...
SQL> select sql_id,blocking_session,count(*)
2 from ash0720
3 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
4 and event='enq: TX - index contention'
5 and sql_id in ('b1kh15rh5d2h9','d71tkyh0b2q1y','8nhv049t5zzh6')
6 group by sql_id,blocking_session
7 order by 1,3 desc;
SQL_ID BLOCKING_SESSION COUNT(*)
-------------------- ---------------- ----------
8nhv049t5zzh6 7881 12736
8nhv049t5zzh6 10634 62
8nhv049t5zzh6 53
8nhv049t5zzh6 3026 3
b1kh15rh5d2h9 532 18733
b1kh15rh5d2h9 8317 8996
b1kh15rh5d2h9 8801 5148
b1kh15rh5d2h9 9340 224
b1kh15rh5d2h9 7
b1kh15rh5d2h9 8621 3
b1kh15rh5d2h9 6628 1
b1kh15rh5d2h9 6817 1
d71tkyh0b2q1y 6538 15560
d71tkyh0b2q1y 11553 5372
d71tkyh0b2q1y 1019 85
d71tkyh0b2q1y 9
16 rows selected.
select session_id,blocking_session,count(*) cnt
from ash0720
where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
and event='enq: TX - index contention'
and sql_id in ('b1kh15rh5d2h9','d71tkyh0b2q1y','8nhv049t5zzh6')
and session_id in (7881,532,8317,6538,11553) group by session_id,blocking_session
SESSION_ID BLOCKING_SESSION CNT
---------- ---------------- ----------
532 8801 32
8317 532 5
11553 6538 53
select session_id,blocking_session,count(*) cnt
from ash0720
where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
and session_id in (8801,6538) group by session_id,blocking_session
SESSION_ID BLOCKING_SESSION CNT
---------- ---------------- ----------
6538 233
6538 6276 32
8801 6276 32
select session_id,blocking_session,count(*) cnt
from ash0720
where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
and session_id in (6276,6538) group by session_id,blocking_session
SESSION_ID BLOCKING_SESSION CNT
---------- ---------------- ----------
6276 234
6538 233
6538 6276 32
最终确认了阻塞会话根源是6276 会话,下面查看问题当时6276 是什么进程?
SQL> l
1 select machine,program,event,wait_time ,time_waited,sql_id,SESSION_STATE,CURRENT_OBJ#
2 from ash0720
3 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi')
4* and session_id in (6276)
SQL> /
MACHINE PROGRAM EVENT WAIT_TIME TIME_WAITED SQL_ID SESSION CURRENT_OBJ#
---------- -------------------------------- ------------------------- ---------- ----------- ------------- ------- ------------
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1487 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1367 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 4800 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 2006 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1198 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 2812 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3939 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 5072 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 6973 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 6783 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1225 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3826 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3154 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) 5070 0 ON CPU -1
qdanbob2 oracle@qdanbob2 (LGWR) 21949 0 ON CPU -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 9892 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 28703 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 436821 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3153 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1487 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 205371 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 5429 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 5180 WAITING -1
qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 134327524 WAITING -1
...
234 rows selected.
Note:
因为篇幅太大已截取, 是因为LGWR 后台进程等待事件log file parallel write,而且wait time 很大,’log file parallel write’发生在事务commit后,等待磁盘write成功确认时。 想到昨晚曾经出现过一个归档日志写入的IO错误,怀疑是I/O 慢的情况存在。 上午通知了通知存储部门检查存储情况,因为上午没有root 密码,中午用root登录后发现OS 日志中确认有存储层的错误如下
#OS log /var/adm/message
Jul 19 01:05:01 qdanbob2 vmunix:
Jul 19 01:05:03 qdanbob2 above message repeats 6 times
Jul 19 01:05:03 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064320150719010503, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064420150719010503, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064520150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064620150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064720150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206
Jul 19 05:28:35 qdanbob2 sshd[29499]: error: PAM: No account present for user for illegal user aqzhdg from 133.96.102.206
Jul 19 05:28:37 qdanbob2 sshd[29499]: error: PAM: No account present for user for illegal user aqzhdg from 133.96.102.206
Jul 19 16:25:15 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 19 16:25:15 qdanbob2 vmunix: The diagnostic logging facility has started receiving excessive
Jul 19 16:25:15 qdanbob2 vmunix: errors. The error entries will be lost until the cause of
Jul 19 16:25:15 qdanbob2 vmunix: the excessive error logging is corrected.
Jul 19 16:25:15 qdanbob2 vmunix: Use oserrlogd(1M) man page for further details.
Jul 19 16:25:20 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 19 16:25:20 qdanbob2 vmunix: The diagnostic logging facility is no longer receiving excessive
Jul 19 16:25:20 qdanbob2 vmunix: errors . 9 error entries were lost.
Jul 19 16:25:23 qdanbob2 vmunix: The diagnostic logging facility has started receiving excessive
Jul 19 16:25:23 qdanbob2 vmunix: errors. The error entries will be lost until the cause of
Jul 19 16:25:23 qdanbob2 vmunix: the excessive error logging is corrected.
Jul 19 16:25:23 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 19 16:25:23 qdanbob2 vmunix: Use oserrlogd(1M) man page for further details.
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error:
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: Path lunpath413 to 000290102087 is dead.Killing bus 6 to Symmetrix 000290102087 port 8aA.Path lunpath320 to 000290102087 is dead.Path lunpath327 to 0
00290102087 is dead.Path lunpath325 to 000290102087 is dead.Path lunpath326 to 000290102087 is dead.Pat
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: h lunpath338 to 000290102087 is dead.Path lunpath337 to 000290102087 is dead.Path lunpath330 to 000290102087 is dead.Path lunpath345 to 000290102087 is d
ead.Path lunpath332 to 000290102087 is dead.Path lunpath351 to 000290102087 is dead.Path lunpath339 to
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: 000290102087 is dead.Path lunpath329 to 000290102087 is dead.Path lunpath323 to 000290102087 is dead.Path lunpath356 to 000290102087 is dead.Path lunpath
328 to 000290102087 is dead.Path lunpath331 to 000290102087 is dead.Path lunpath321 to 000290102087 is
...
407 to 000290102087 is dead.Path lunpath402 to 000290102087 is dead.Path lunpath311 to 000290102087 is
Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: dead.Path lunpath410 to 000290102087 is dead.Path lunpath408 to 000290102087 is dead.Path lunpath400 to 000290102087 is dead.Path lunpath409 to 000290102
087 is dead.Path lunpath310 to 000290102087 is dead.Path lunpath419 to 000290102087 is dead.Path lunpat
...
Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2004
Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xd000120)
Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient
Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds.
Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state.
Jul 20 14:24:50 qdanbob2 vmunix:
Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2005
Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xcb250200)
Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient
Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds.
Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state.
Jul 20 14:24:50 qdanbob2 vmunix:
Jul 20 14:24:50 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 20 14:24:50 qdanbob2 vmunix: The diagnostic logging facility has started receiving excessive
Jul 20 14:24:50 qdanbob2 vmunix: errors. The error entries will be lost until the cause of
Jul 20 14:24:50 qdanbob2 vmunix: the excessive error logging is corrected.
Jul 20 14:24:50 qdanbob2 vmunix: Use oserrlogd(1M) man page for further details.
Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2006
Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xcb250300)
Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient
Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds.
Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state.
Jul 20 14:24:50 qdanbob2 vmunix:
Jul 20 14:24:51 qdanbob2 vmunix: emcp:Mpx:Error:
Jul 20 14:25:01 qdanbob2 vmunix: class : tgtpath, instance 6
Jul 20 14:25:01 qdanbob2 vmunix: Target path (class=tgtpath, instance=6) has gone online. The target path h/w path is 41/0/2/2/0/0/0.0x5006048452a6d1c7
Jul 20 14:25:01 qdanbob2 vmunix:
Jul 20 14:25:01 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING:
Jul 20 14:25:01 qdanbob2 vmunix: The diagnostic logging facility is no longer receiving excessive
Jul 20 14:25:01 qdanbob2 vmunix: errors . 2 error entries were lost.
Jul 20 14:28:42 qdanbob2 vmunix: emcp:Mpx:Info:
Jul 20 14:30:37 qdanbob2 vmunix: emcp:Mpx:Info: h374 to 000290102087 is dead.Path lunpath248 to 000290102087 is dead.Path lunpath369 to 000290102087 is dead.Path lunpath279 to 000290102087 is alive.Path
lunpath279 to 000290102087 is dead.Path lunpath409 to 000290102087 is alive.Path lunpath369 to 000290
Jul 20 14:34:06 qdanbob2 vmunix: emcp:Mpx:Info:
Jul 20 14:35:11 qdanbob2 vmunix: class : lunpath, instance 374
Jul 20 14:35:11 qdanbob2 vmunix: lun path (class = lunpath, instance = 374) belonging to LUN (default minor = 0xec) has gone offline. The lunpath hwpath is 41/0/2/2/0/0/0.0x5006048452a6d1c7.0x411200000
0000000
Jul 20 14:35:11 qdanbob2 vmunix:
Jul 20 14:35:50 qdanbob2 vmunix: class : lunpath, instance 374
Jul 20 14:35:50 qdanbob2 vmunix: lun path (class = lunpath, instance = 374) belonging to LUN (default minor = 0xec) has come online
Jul 20 14:35:50 qdanbob2 vmunix:
Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error:
Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error: 102087 is alive.Path lunpath243 to 000290102087 is alive.Path lunpath243 to 000290102087 is dead.Killing bus 6 to Symmetrix 000290102087 port 8aA.Pat
h lunpath409 to 000290102087 is dead.Path lunpath369 to 000290102087 is dead.Path lunpath288 to 0002901
Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error:
确认从昨天16:20后存储就开始出现问题,不稳定或间歇性问题出现。这时存储方成确认了有点存储链路有问题
qdanbob2[/var/adm]#powermt display
Symmetrix logical device count=202
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------
### HW Path Summary Total Dead IO/Sec Q-IOs Errors
==============================================================================
0 41/0/0/2/0/0/0 optimal 194 0 - 2 0
6 41/0/2/2/0/0/0 degraded 16 12 - 0 10 <<<<<
8 44/0/0/2/0/0/0 optimal 194 0 - 3 0
42 44/0/2/2/0/0/0 optimal 194 0 - 1 0
禁掉了#6存储链路,数据库恢复了正常,I/O 延迟也明显下降, 最终定位为存储问题,因为是偶尔出现,AWR 是一小时的粒度,一开始没有看,后来发现AWR里TOP EVENT里I/O avg time也是比较大的,而且看pw+pr iops 也可以做为参考。
qdanbob2[/var/adm]#powerrmt display
sh: powerrmt: not found.
qdanbob2[/var/adm]#
qdanbob2[/var/adm]#powermt display
Symmetrix logical device count=202
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------
### HW Path Summary Total Dead IO/Sec Q-IOs Errors
==============================================================================
0 41/0/0/2/0/0/0 optimal 194 0 - 2 0
6 41/0/2/2/0/0/0 failed 16 16 - 0 0 <<<<<<<<
8 44/0/0/2/0/0/0 optimal 194 0 - 2 0
42 44/0/2/2/0/0/0 optimal 194 0 - 1 0
Summary:
整个问题现在可以梳理清楚,从昨天16:30 起存储链路就出现了问题或者间歇性, 当insert 这种大批量灌数据及每条commit发生时,存储问题就使数据库I/O类wait event变的非常明显,比如log file sync,索引分裂,gc buffer, read by other session.. , 禁掉了4条存储链路中坏的一条后,恢复了正常,后续需要加强对存储的监控。