对于打工人可能最痛苦的就是被告知的故障,数据库有监控、告警、每天巡检,自己做了一系列数据库的“安保”工作,本以为可以万无一失,中午在安心的睡觉中,被人告知数据库crash了。当时的我一脸懵逼,不可能,没有告警,半小时前我还登进去查信息呢,怎么可能crash了呢,结果登陆系统被现实一幕啪啪打脸,数据库1节点crash了,2节点正常,监控和告警突然也有问题了,当然今天重点不在这。立马启动数据库~咦~数据库真的起来了,那么怎么宕的呢,需要研究alert日志了。
###alert.log
Archived Log entry 7660224 added for thread 1 sequence 1179174 ID 0x5ad9d9fa dest 1:
Thu Jan 14 14:10:04 2021
TABLE E2EQA.F_JIAKUAN_DPI_DNS_USER_IP_H: ADDED INTERVAL PARTITION SYS_P5860754 (17860) VALUES LESS THAN (TO_DATE(' 2021-01-14 04:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Thu Jan 14 14:10:14 2021
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc (incident=564829):
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/nmsb/nmsb1/incident/incdir_564829/nmsb1_lms4_248229_i564829.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jan 14 14:10:15 2021
Dumping diagnostic data in directory=[cdmp_20210114141015], requested by (instance=1, osid=248229 (LMS4)), summary=[incident=564829].
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc:
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
Thu Jan 14 14:10:16 2021
Sweep [inc][564829]: completed
Sweep [inc2][564829]: completed
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc:
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
LMS4 (ospid: 248229): terminating the instance due to error 484
Thu Jan 14 14:10:17 2021
opiodr aborting process unknown ospid (144669) as a result of ORA-1092
从alert.log发现数据库在1月14日 14:10分遭遇ORA-00600[kjmchkiseq:!seq]错误,通过trace发现这个报错是由LMS进程触发的,继续检查trace日志
###nmsb1_lms4_248229_i564829.trc
## nmsb1_lms4_248229_i564829.trc
Dump continued from file: /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
========= Dump for incident 564829 (ORA 600 [kjmchkiseq:!seq]) ========
----- Beginning of Customized Incident Dump(s) -----
MSG [34:KJX_B_CLOSE] [0xad0f9.87a] inc=20 len=144 sender=(2,2) seq=0 <---block 708857 file 2170
fg=iqb stat=KJUSERSTAT_DONE spnum=18 flg=x2
----- End of Customized Incident Dump(s) -----
Session Wait History:
elapsed time of 0.646200 sec since last wait
0: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542447 seq_num=55512 snap_id=1
wait times: snap=0.000098 sec, exc=0.000098 sec, total=0.000098 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000055 sec of elapsed time
1: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542446 seq_num=55511 snap_id=1
wait times: snap=0.001050 sec, exc=0.001050 sec, total=0.001050 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000010 sec of elapsed time
2: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542445 seq_num=55510 snap_id=1
wait times: snap=0.000127 sec, exc=0.000127 sec, total=0.000127 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000023 sec of elapsed time
3: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542444 seq_num=55509 snap_id=1
wait times: snap=0.000208 sec, exc=0.000208 sec, total=0.000208 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000064 sec of elapsed time
4: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542443 seq_num=55508 snap_id=1
wait times: snap=0.000380 sec, exc=0.000380 sec, total=0.000380 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000020 sec of elapsed time
5: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542442 seq_num=55507 snap_id=1
wait times: snap=0.000385 sec, exc=0.000385 sec, total=0.000385 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000014 sec of e lapsed time
6: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542441 seq_num=55506 snap_id=1
wait times: snap=0.000057 sec, exc=0.000057 sec, total=0.000057 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000045 sec of elapsed time
7: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542440 seq_num=55505 snap_id=1
wait times: snap=0.000242 sec, exc=0.000242 sec, total=0.000242 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000014 sec of elapsed time
8: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542439 seq_num=55504 snap_id=1
wait times: snap=0.000315 sec, exc=0.000315 sec, total=0.000315 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000033 sec of elapsed time
9: waited for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=3658542438 seq_num=55503 snap_id=1
wait times: snap=0.000561 sec, exc=0.000561 sec, total=0.000561 sec
wait times: max=0.030000 sec
wait counts: calls=1 os=1
occurred after 0.000018 sec of elapsed time
----- DDE Diagnostic Information Dump -----
Depth: 1
DDE flags: 0x0
Heap: 0x7f44fbd51ec0
Incident Context pointer in diag: 0x7fff75931748
Incident ID Cache: 0x137d41e5e8
Invocation Context #: 0
----- Invocation Context Dump -----
Address: 0x7f44fbd55f60
Phase: 3
flags: 0x18E0001
Incident ID: 564829
Error Descriptor: ORA-600 [kjmchkiseq:!seq] [0] [0] [2147483647] [255] [34] [2] [1] [] [] [] []
Error class: 0
Problem Key # of args: 1
Number of actions: 11
----- Incident Context Dump -----
Address: 0x7fff75931748
Incident ID: 564829
Problem Key: ORA 600 [kjmchkiseq:!seq]
Error: ORA-600 [kjmchkiseq:!seq] [0] [0] [2147483647] [255] [34] [2] [1] [] [] [] []
[00]: dbgexExplicitEndInc [diag_dde]
[01]: dbgeEndDDEInvocationImpl [diag_dde]
[02]: dbgeEndDDEInvocation [diag_dde]
[03]: kjmvalidate [RAC_MLMDS]<-- Signaling
[04]: kjmpbmsg [RAC_MLMDS]
[05]: kjmsm [RAC_MLMDS]
[06]: ksbrdp [background_proc]
[07]: opirip []
[08]: opidrv []
[09]: sou2o []
[10]: opimai_real []
[11]: ssthrdmain []
[12]: main []
[13]: __libc_start_main []
MD [00]: 'SID'='1423.1' (0x2)
MD [01]: 'ProcId'='18.1' (0x2)
MD [02]: 'Client ProcId'='oracle@dm05db01.bmcc.com.cn.248229_139934305621760' (0x0)
End of KJC Communication Dump
DEFER MSG QUEUE 0 ON LMS4 IS EMPTY
SEQUENCES:
[sidx 0] 0:0.0 1:442149.0
DEFER MSG QUEUE 1 ON LMS4 IS EMPTY
SEQUENCES:
[sidx 1] 0:0.0 1:2147483647.255
DEFER MSG QUEUE 2 ON LMS4 IS EMPTY
SEQUENCES:
[sidx 2] 0:0.0 1:2147483647.255
error 484 detected in background process
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []<<<<<<<<kjmchkiseq - check for indirect sequence prior processing. sequence for sendq message must be increasing. 0 = received message seq 0 = received message wrap 2147483647= last message seq 255 = last message wrap
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5608<-ksbrdp()+3507<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
----- End of Abridged Call Stack Trace -----
*** 2021-01-14 14:10:17.044
LMS4 (ospid: 248229): terminating the instance due to error 484
从trace中发现由于LMS进程在处理message时失败(持续了三次,抛出三次ora-00600),在rac的运行机制下档LMS进程收到错误消息时为了保证数据一致性将1节点踢出集群。那么是什么原因导致LMS接受消息失败呢,是节点间通信有问题吗,我们收集了OSW查看
Ip:
1845957605880 total packets received
1373116 with invalid addresses
0 forwarded
0 incoming packets discarded
1845955287255 incoming packets delivered
1789568708375 requests sent out
5 outgoing packets dropped
96 dropped because of missing route
2 fragments dropped after timeout
279032 reassemblies required
139515 packets reassembled ok
2 packet reassembles failed
139515 fragments received ok
279030 fragments created
Icmp:
48861429 ICMP messages received
8479 input ICMP message failed.
ICMP input histogram:
destination unreachable: 19166487
timeout in transit: 2962
echo requests: 12984085
echo replies: 16701821
timestamp request: 60
address mask request: 99
37334456 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 7637180
time exceeded: 2
echo request: 16713224
echo replies: 12983990
timestamp replies: 60
IcmpMsg:
InType0: 16701821
InType3: 19166487
InType8: 12984085
InType9: 5905
InType11: 2962
InType13: 60
InType17: 99
InType165: 10
OutType0: 12983990
OutType3: 7637180
OutType8: 16713224
OutType11: 2
OutType14: 60
Udp:
1163780019445 packets received
7918856 packets to unknown port received.
942462850 packet receive errors
1163464113908 packets sent
RcvbufErrors: 942462850
IgnoredMulti: 159009153
zzz <01/14/2021 14:10:16> subcount: 111
Ip:
1845957797979 total packets received
1373116 with invalid addresses
0 forwarded
0 incoming packets discarded
1845955479354 incoming packets delivered
1789568895187 requests sent out
5 outgoing packets dropped
96 dropped because of missing route
2 fragments dropped after timeout
279032 reassemblies required
139515 packets reassembled ok
2 packet reassembles failed
139515 fragments received ok
279030 fragments created
Icmp:
48861441 ICMP messages received
8479 input ICMP message failed.
ICMP input histogram:
destination unreachable: 19166499
timeout in transit: 2962
echo requests: 12984085
echo replies: 16701821
timestamp request: 60
address mask request: 99
37334458 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 7637182
time exceeded: 2
echo request: 16713224
echo replies: 12983990
timestamp replies: 60
IcmpMsg:
InType0: 16701821
InType3: 19166499
InType8: 12984085
InType9: 5905
InType11: 2962
InType13: 60
InType17: 99
InType165: 10
OutType0: 12983990
OutType3: 7637182
OutType8: 16713224
OutType11: 2
OutType14: 60
Udp:
1163780060287 packets received
7918858 packets to unknown port received.
942462852 packet receive errors >>>>>>>>>>>>> incr 2
1163464154791 packets sent
RcvbufErrors: 942462852
IgnoredMulti: 159009252
检查数据库patch
数据库版本为Oracle 11.2.0.4.180417 for exadata
在故障时间段从IP层面监控fragments dropped分片/组包正常,UDP有少量丢包现象,不是节点间通讯的问题,继续查看metalink案例发现一片文章,BUG Bug 23628685 - RAC instance crashed by lms with ORA-600 [kjmchkiseq:!seq] (Doc ID 23628685.8)
metalink描述
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255]
Where:
1 = received message seq
0 = received message wrap
2147483646 = last message seq
255 = last message wrap
Instances have been up for a very long time (in this customers case 592 days) the message header wrap can itself wrap past 255 back to 0.
In this case the next message that has a seq=1 and wrp=0 will fail with a kjmchkiseq:!seq assert as the stored wrp is 255 from the previous message
db alert<sid>.log
-------------------------
Tue Jun 21 09:02:53 2016
Archived Log entry 1100596 added for thread 1 sequence 177386 ID 0x438fd7bc
dest 1:
Tue Jun 21 09:04:59 2016
Errors in file <DIR>/trace/<INSTANCE>_lms1_27192.trc
(incident=1480143):
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255], [34], [0], [1], [], [], [], []
Incident details in:
<DIR>/incident/incdir_1480143/<INSTANCE>_lms1_27192_i1480143.trc
-------------------------------
<sid>_lms1_27192_i1480143.trc
-----------------------
Dump continued from file:
<DIR>/trace/<INSTANCE>_lms1_27192.trc
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255], [34], [0], [1], [], [], [], []
========= Dump for incident 1480143 (ORA 600 [kjmchkiseq:!seq]) ========
----- Beginning of Customized Incident Dump(s) -----
MSG [34:KJX_B_CLOSE] [0x237f3c.c0] inc=4 len=144 sender=(2,2) seq=1
fg=iqb stat=KJUSERSTAT_DONE spnum=15 flg=x0
----- End of Customized Incident Dump(s) -----
STACK TRACE:
------------
Error: ORA-600 [kjmchkiseq:!seq] [1] [0] [2147483646] [255] [34] [0] [1] []
[] [] []
[00]: dbgexExplicitEndInc [diag_dde]
[01]: dbgeEndDDEInvocationImpl [diag_dde]
[02]: dbgeEndDDEInvocation [diag_dde]
[03]: kjmvalidate [RAC_MLMDS]<-- Signaling
2021/1/14 Document 23628685.8
https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=e7wrd5piw_236&id=23628685.8 3/3
[04]: kjmpbmsg [RAC_MLMDS]
[05]: kjmsm [RAC_MLMDS]
[06]: ksbrdp [background_proc]
[07]: opirip []
[08]: opidrv []
[09]: sou2o []
[10]: opimai_real []
[11]: ssthrdmain []
[12]: main []
[13]: __libc_start_main []
现象跟这个bug及其相似,糟糕的是oracle对于bug触发原因并没有过多的解释,咨询sr也没有过多解释,可通过打补丁修复bug。