4.29上午,生产环境数据库用户无法登陆,提示归档日志错误,只能以超级管理员登陆,但是关于日志的相关操作都hang住,由于时间紧迫,未做hanganzlyze操作,正常停库过程也hang,只能以shutdown abort停掉实例,然后在重新启库过程中正常,未遇到错误提示,数据库正常打开,操作暂时恢复,持续有10分钟左右,中间日志切换无反应,用户登录又hang,于是再做shutdown abort->startup,过5分钟后数据库正常,linux系统,11.2.0.4版本数据库,那么接下来分析问题:
1、 根据数据库警告日志:
TueApr 28 18:42:53 2015
Errors in file/u01/app/oracle/diag/rdbms/iokbl/iokbl/trace/iokbl_lgwr_128898.trc:
ORA-27152: attempt topost process failed
ORA-27300: OS system dependentoperation:semop failed with status: 7
ORA-27301: OS failuremessage: Argument list too long
ORA-27302: failureoccurred at: sskgpwvp2
ORA-27303: additionalinformation: old_semid post_count 0x158003 39
LGWR (ospid: 128898):terminating the instance due to error 27152
Tue Apr 28 18:42:53 2015
System state dump requested by (instance=1,osid=128898 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file/u01/app/oracle/diag/rdbms/iokbl/iokbl/trace/iokbl_diag_128882_20150428184253.trc
Dumping diagnostic data indirectory=[cdmp_20150428184253], requested by (instance=1, osid=128898 (LGWR)),summary=[abnormal instance termination].
Instance terminated byLGWR, pid = 128898
Wed Apr 29 10:14:18 2015
Starting ORACLE instance(normal)
通过红色字体,能够了解到,在4.28日18:42:53时,数据库报错,提示 尝试发送进程失败,操作系统问题,status为7,失败信息为:参数列表过长等等,然后看到LGWR进程导致实例终止。最后在4.29日10:14:18启动实例。LGWR进程为ORALCE核心进程之一,此进程若die 则ORACLE实例会宕。
2、观察iokbl_lgwr_128898.trc追踪文件,发现:
*** 2015-04-2818:42:53.168
Warning: log write elapsed time 33787ms, size 1KB
error 27152detected in background process
ORA-27152:attempt to post process failed
ORA-27300: OSsystem dependent operation:semop failed with status: 7
ORA-27301: OSfailure message: Argument list too long
ORA-27302:failure occurred at: sskgpwvp2
ORA-27303: additionalinformation: old_semid post_count 0x158003 39
kjzduptcctx:Notifying DIAG for crash event
----- AbridgedCall Stack Trace -----
ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5570<-ksbrdp()+3507<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
----- End ofAbridged Call Stack Trace -----
*** 2015-04-2818:42:53.234
LGWR (ospid:128898): terminating the instance due to error 27152
ksuitm: waitingup to [5] seconds before killing DIAG(128882)
写日志消耗33787ms,大小1KB,IO已明显迟缓。LGWR日志会记录超过500MS的信息。
3、分析SYSTEM STATE DUMP文件,此处直接用ass.awk工具转换:
Ass.Awk Version 1.0.9 - Processing/u01/app/oracle/diag/rdbms/iokbl/iokbl/trace/iokbl_diag_128882_20150428184253.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: 0: waiting for 'pmon timer'
3: 0: waiting for 'rdbms ipc message'
4: 0: waiting for 'VKTM Logical Idle Wait'
5: 0: waiting for 'rdbms ipc message'
6: 0: waiting for 'DIAG idle wait'
7: 0: waiting for 'rdbms ipc message'
8: 0: waiting for 'DIAG idle wait'
9: 0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14:
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'smon timer'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'ASM background timer'
20: 0: waiting for 'rdbms ipc message'
21: 0: waiting for 'rdbms ipc message'
22:
23: 0: waiting for 'wait for unread messageon broadcast channel'
24:
25: 0: waiting for 'log file sync'
26: 0: waiting for 'class slave wait'
27: 0: waiting for 'SQL*Net message fromclient'
28: 0: waiting for 'rdbms ipc message'
29: 0: waiting for 'rdbms ipc message'
30: 0: waiting for 'control file sequentialread'
31: 0: waiting for 'rdbms ipc message'
32: 0: waiting for 'rdbms ipc message'
33: 0: waiting for 'class slave wait'
34: 0: waiting for 'Streams AQ: qmncoordinator idle wait'
35: 0: waiting for 'log file sync'
36: 0: waiting for 'log file sync'
37: 0: waiting for 'log file sync'
38: 0: waiting for 'SQL*Net message fromclient'
39: 0: waiting for 'log file sync'
40: 0: waiting for 'log file sync'
41: 0: waiting for 'log file sync'
42: 0: waiting for 'log file sync'
43: 0: waiting for 'log file sync'
44: 0: waiting for 'SQL*Net message fromclient'
45: 0: waiting for 'SQL*Net message fromclient'
46: 0: waiting for 'rdbms ipc message'
47: 0: waiting for 'log file sync'
48: 0: waiting for 'SQL*Net message fromclient'
49: 0: waiting for 'log file sync'
50: 0: waiting for 'Space Manager: slaveidle wait'
51: 0: waiting for 'SQL*Net message fromclient'
52: 0: waiting for 'log file sync'
53: 0: waiting for 'SQL*Net message fromclient'
54: 0: waiting for 'log file sync'
55: 0: waiting for 'log file sync'
56: 0: waiting for 'log file sync'
57: 0: waiting for 'Streams AQ: qmn slaveidle wait'
58: 0: waiting for 'log file sync'
59: 0: waiting for 'log file sync'
60: 0: waiting for 'SQL*Net message fromclient'
61: 0: waiting for 'rdbms ipc message'
62: 0: waiting for 'log file sync'
63: 0: waiting for 'log file sync'
64: 0: waiting for 'Streams AQ: qmn slaveidle wait'
66: 0: waiting for 'class slave wait'
67: 0: waiting for 'class slave wait'
68: 0: waiting for 'SQL*Net message fromclient'
69: 0: waiting for 'log file sync'
70: 0: waiting for 'log file sync'
71: 0: waiting for 'SQL*Net message fromclient'
72: 9: waited for 'Streams AQ: waiting fortime management or cleanup tasks'
73: 0: waiting for 'log file sync'
74: 0: waiting for 'log file sync'
75: 0: waiting for 'log file sync'
76: 0: waiting for 'log file sync'
77: 0: waiting for 'log file sync'
78: 0: waiting for 'log file sync'
79: 0: waiting for 'log file sync'
80: 0: waiting for 'log file sync'
81: 0: waiting for 'log file sync'
82: 0: waiting for 'log file sync'
83: 0: waiting for 'log file sync'
84: 0: waiting for 'log file sync'
85: 0: waiting for 'log file sync'
86: 0: waiting for 'log file sync'
87: 0: waiting for 'log file sync'
88: 0: waiting for 'log file sync'
89: 0: waiting for 'log file sync'
90: 0: waiting for 'log file sync'
NO BLOCKING PROCESSES FOUND
126139 Lines Processed.
可以看出,主要等待事件为log file sync,那么初步分析为此事件导致LGWR进程死掉从而导致实例宕。
4、根据oracle官方文档介绍,ora-27300---ora-27303错误的介绍是:
27300, 00000, "OS system dependentoperation:%s failed with status: %s"
// *Cause: OS system call error
// *Action: contact Oracle Support
通过 MOS查找以上报错信息,未找到status为7的troubleshouting。5、根据AWR 报告显示,在17:00-18:00之间,在数据库实例宕之前,IO问题已经明显,如图:
6、再看操作系统日志:
Apr 28 18:43:24 iokbldb1 kernel: sd1:0:2:1: [sdag] Unhandled error code
Apr 28 18:43:24 iokbldb1 kernel: sd1:0:2:1: [sdag]
Apr 28 18:43:24 iokbldb1 kernel: Result:hostbyte=DID_ERROR driverbyte=DRIVER_OK
Apr 28 18:43:24 iokbldb1 kernel: sd1:0:2:1: [sdag] CDB:
Apr 28 18:43:24 iokbldb1 kernel: Read(10):28 00 00 0b f4 00 00 04 00 00
Apr 28 18:43:24 iokbldb1 kernel:end_request: I/O error, dev sdag, sector 783360
Apr 28 18:43:24 iokbldb1 kernel:device-mapper: multipath: Failing path 66:0.
Apr 28 18:43:24 iokbldb1 multipathd: 66:0:mark as failed
Apr 28 18:43:24 iokbldb1 multipathd:mpathc: remaining active paths: 7
Apr 28 18:43:25 iokbldb1 kernel: sd1:0:1:1: [sdab] Unhandled error code
Apr 28 18:43:25 iokbldb1 kernel: sd1:0:1:1: [sdab]
Apr 28 18:43:25 iokbldb1 kernel: Result:hostbyte=DID_ERROR driverbyte=DRIVER_OK
Apr 28 18:43:25 iokbldb1 kernel: sd1:0:1:1: [sdab] CDB:
Apr 28 18:43:25 iokbldb1 kernel: Read(10):28 00 00 04 48 00 00 04 00 00
Apr 28 18:43:25 iokbldb1 kernel:end_request: I/O error, dev sdab, sector 280576
Apr 28 18:43:25 iokbldb1 kernel:device-mapper: multipath: Failing path 65:176.
另通过dmesg|grep error可以看到:
IO是有问题的。
第二天,IBM存储工程师检测存储时发现主机与存储之间链路有问题,更换光纤后正常。