一个BUG引发的灾难:ORA-00600 [kjmchkiseq:!seq]

       对于打工人可能最痛苦的就是被告知的故障,数据库有监控、告警、每天巡检,自己做了一系列数据库的“安保”工作,本以为可以万无一失,中午在安心的睡觉中,被人告知数据库crash了。当时的我一脸懵逼,不可能,没有告警,半小时前我还登进去查信息呢,怎么可能crash了呢,结果登陆系统被现实一幕啪啪打脸,数据库1节点crash了,2节点正常,监控和告警突然也有问题了,当然今天重点不在这。立马启动数据库~咦~数据库真的起来了,那么怎么宕的呢,需要研究alert日志了。

###alert.log

Archived Log entry 7660224 added for thread 1 sequence 1179174 ID 0x5ad9d9fa dest 1:
Thu Jan 14 14:10:04 2021
TABLE E2EQA.F_JIAKUAN_DPI_DNS_USER_IP_H: ADDED INTERVAL PARTITION SYS_P5860754 (17860) VALUES LESS THAN (TO_DATE(' 2021-01-14 04:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Thu Jan 14 14:10:14 2021
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc  (incident=564829):
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/nmsb/nmsb1/incident/incdir_564829/nmsb1_lms4_248229_i564829.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jan 14 14:10:15 2021
Dumping diagnostic data in directory=[cdmp_20210114141015], requested by (instance=1, osid=248229 (LMS4)), summary=[incident=564829].
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc:
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []

Thu Jan 14 14:10:16 2021
Sweep [inc][564829]: completed
Sweep [inc2][564829]: completed
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc:
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []

LMS4 (ospid: 248229): terminating the instance due to error 484
Thu Jan 14 14:10:17 2021
opiodr aborting process unknown ospid (144669) as a result of ORA-1092

       从alert.log发现数据库在1月14日 14:10分遭遇ORA-00600[kjmchkiseq:!seq]错误,通过trace发现这个报错是由LMS进程触发的,继续检查trace日志

###nmsb1_lms4_248229_i564829.trc

## nmsb1_lms4_248229_i564829.trc
Dump continued from file: /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []

========= Dump for incident 564829 (ORA 600 [kjmchkiseq:!seq]) ========
----- Beginning of Customized Incident Dump(s) -----
MSG [34:KJX_B_CLOSE] [0xad0f9.87a] inc=20 len=144 sender=(2,2) seq=0   <---block 708857 file 2170
     fg=iqb stat=KJUSERSTAT_DONE spnum=18 flg=x2

----- End of Customized Incident Dump(s) -----

Session Wait History:
        elapsed time of 0.646200 sec since last wait
     0: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542447 seq_num=55512 snap_id=1
        wait times: snap=0.000098 sec, exc=0.000098 sec, total=0.000098 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000055 sec of elapsed time
     1: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542446 seq_num=55511 snap_id=1
        wait times: snap=0.001050 sec, exc=0.001050 sec, total=0.001050 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000010 sec of elapsed time
     2: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542445 seq_num=55510 snap_id=1
        wait times: snap=0.000127 sec, exc=0.000127 sec, total=0.000127 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000023 sec of elapsed time
     3: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542444 seq_num=55509 snap_id=1
        wait times: snap=0.000208 sec, exc=0.000208 sec, total=0.000208 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000064 sec of elapsed time
     4: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542443 seq_num=55508 snap_id=1
        wait times: snap=0.000380 sec, exc=0.000380 sec, total=0.000380 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000020 sec of elapsed time
     5: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542442 seq_num=55507 snap_id=1
        wait times: snap=0.000385 sec, exc=0.000385 sec, total=0.000385 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000014 sec of e lapsed time
     6: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542441 seq_num=55506 snap_id=1
        wait times: snap=0.000057 sec, exc=0.000057 sec, total=0.000057 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000045 sec of elapsed time
     7: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542440 seq_num=55505 snap_id=1
        wait times: snap=0.000242 sec, exc=0.000242 sec, total=0.000242 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000014 sec of elapsed time
     8: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542439 seq_num=55504 snap_id=1
        wait times: snap=0.000315 sec, exc=0.000315 sec, total=0.000315 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000033 sec of elapsed time
     9: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542438 seq_num=55503 snap_id=1
        wait times: snap=0.000561 sec, exc=0.000561 sec, total=0.000561 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000018 sec of elapsed time
----- DDE Diagnostic Information Dump -----
Depth: 1
DDE flags: 0x0
Heap: 0x7f44fbd51ec0
Incident Context pointer in diag: 0x7fff75931748
Incident ID Cache: 0x137d41e5e8 
Invocation Context #: 0
----- Invocation Context Dump -----
Address: 0x7f44fbd55f60
Phase: 3
flags: 0x18E0001
Incident ID: 564829
Error Descriptor: ORA-600 [kjmchkiseq:!seq] [0] [0] [2147483647] [255] [34] [2] [1] [] [] [] []
Error class: 0
Problem Key # of args: 1
Number of actions: 11
----- Incident Context Dump -----
Address: 0x7fff75931748
Incident ID: 564829
Problem Key: ORA 600 [kjmchkiseq:!seq]
Error: ORA-600 [kjmchkiseq:!seq] [0] [0] [2147483647] [255] [34] [2] [1] [] [] [] []
[00]: dbgexExplicitEndInc [diag_dde]
[01]: dbgeEndDDEInvocationImpl [diag_dde]
[02]: dbgeEndDDEInvocation [diag_dde]
[03]: kjmvalidate [RAC_MLMDS]<-- Signaling  
[04]: kjmpbmsg [RAC_MLMDS]
[05]: kjmsm [RAC_MLMDS]
[06]: ksbrdp [background_proc]
[07]: opirip []  
[08]: opidrv []
[09]: sou2o []
[10]: opimai_real []
[11]: ssthrdmain []
[12]: main []
[13]: __libc_start_main []
MD [00]: 'SID'='1423.1' (0x2)
MD [01]: 'ProcId'='18.1' (0x2)
MD [02]: 'Client ProcId'='oracle@dm05db01.bmcc.com.cn.248229_139934305621760' (0x0)

End of KJC Communication Dump
 DEFER MSG QUEUE 0 ON LMS4 IS EMPTY
 SEQUENCES: 
[sidx 0]  0:0.0  1:442149.0
 DEFER MSG QUEUE 1 ON LMS4 IS EMPTY
 SEQUENCES: 
[sidx 1]  0:0.0  1:2147483647.255
 DEFER MSG QUEUE 2 ON LMS4 IS EMPTY
 SEQUENCES: 
[sidx 2]  0:0.0  1:2147483647.255
error 484 detected in background process
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []<<<<<<<<kjmchkiseq - check for indirect sequence prior processing. sequence for sendq message must be increasing.  0 = received message seq   0 = received message wrap 2147483647= last message seq 255 = last message wrap 

kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5608<-ksbrdp()+3507<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 
----- End of Abridged Call Stack Trace -----

*** 2021-01-14 14:10:17.044
LMS4 (ospid: 248229): terminating the instance due to error 484

       从trace中发现由于LMS进程在处理message时失败(持续了三次,抛出三次ora-00600),在rac的运行机制下档LMS进程收到错误消息时为了保证数据一致性将1节点踢出集群。那么是什么原因导致LMS接受消息失败呢,是节点间通信有问题吗,我们收集了OSW查看

Ip:
    1845957605880 total packets received
    1373116 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1845955287255 incoming packets delivered
    1789568708375 requests sent out
    5 outgoing packets dropped
    96 dropped because of missing route
    2 fragments dropped after timeout
    279032 reassemblies required
    139515 packets reassembled ok
    2 packet reassembles failed
    139515 fragments received ok
    279030 fragments created
Icmp:
    48861429 ICMP messages received
    8479 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 19166487
        timeout in transit: 2962
        echo requests: 12984085
        echo replies: 16701821
        timestamp request: 60
        address mask request: 99
    37334456 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 7637180
        time exceeded: 2
        echo request: 16713224
        echo replies: 12983990
        timestamp replies: 60
IcmpMsg:
        InType0: 16701821
        InType3: 19166487
        InType8: 12984085
        InType9: 5905
        InType11: 2962
        InType13: 60
        InType17: 99
        InType165: 10
        OutType0: 12983990
        OutType3: 7637180
        OutType8: 16713224
        OutType11: 2
        OutType14: 60
Udp:
    1163780019445 packets received
    7918856 packets to unknown port received.
    942462850 packet receive errors
    1163464113908 packets sent
    RcvbufErrors: 942462850
IgnoredMulti: 159009153

zzz <01/14/2021 14:10:16> subcount: 111
Ip:
    1845957797979 total packets received
    1373116 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1845955479354 incoming packets delivered
    1789568895187 requests sent out
    5 outgoing packets dropped
    96 dropped because of missing route
    2 fragments dropped after timeout
    279032 reassemblies required
    139515 packets reassembled ok
    2 packet reassembles failed
    139515 fragments received ok
    279030 fragments created
Icmp:
    48861441 ICMP messages received
    8479 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 19166499
        timeout in transit: 2962
        echo requests: 12984085
        echo replies: 16701821
        timestamp request: 60
        address mask request: 99
    37334458 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 7637182
        time exceeded: 2
        echo request: 16713224
        echo replies: 12983990
        timestamp replies: 60
IcmpMsg:
        InType0: 16701821
        InType3: 19166499
        InType8: 12984085
        InType9: 5905
        InType11: 2962
        InType13: 60
        InType17: 99
        InType165: 10
        OutType0: 12983990
        OutType3: 7637182
        OutType8: 16713224
        OutType11: 2
        OutType14: 60
Udp:
    1163780060287 packets received
    7918858 packets to unknown port received.
    942462852 packet receive errors >>>>>>>>>>>>> incr 2
    1163464154791 packets sent
    RcvbufErrors: 942462852
    IgnoredMulti: 159009252 

检查数据库patch

数据库版本为Oracle 11.2.0.4.180417 for exadata

在故障时间段从IP层面监控fragments dropped分片/组包正常,UDP有少量丢包现象,不是节点间通讯的问题,继续查看metalink案例发现一片文章,BUG Bug 23628685 - RAC instance crashed by lms with ORA-600 [kjmchkiseq:!seq] (Doc ID 23628685.8)

metalink描述

ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255]
Where:
1 = received message seq
0 = received message wrap
2147483646 = last message seq
255 = last message wrap
Instances have been up for a very long time (in this customers case 592 days) the message header wrap can itself wrap past 255 back to 0.
In this case the next message that has a seq=1 and wrp=0 will fail with a kjmchkiseq:!seq assert as the stored wrp is 255 from the previous message

db alert<sid>.log
-------------------------
Tue Jun 21 09:02:53 2016
Archived Log entry 1100596 added for thread 1 sequence 177386 ID 0x438fd7bc
dest 1:
Tue Jun 21 09:04:59 2016
Errors in file <DIR>/trace/<INSTANCE>_lms1_27192.trc
(incident=1480143):
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255], [34], [0], [1], [], [], [], []
Incident details in:
<DIR>/incident/incdir_1480143/<INSTANCE>_lms1_27192_i1480143.trc
-------------------------------
<sid>_lms1_27192_i1480143.trc
-----------------------
Dump continued from file:
<DIR>/trace/<INSTANCE>_lms1_27192.trc
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255], [34], [0], [1], [], [], [], []
========= Dump for incident 1480143 (ORA 600 [kjmchkiseq:!seq]) ========
----- Beginning of Customized Incident Dump(s) -----
MSG [34:KJX_B_CLOSE] [0x237f3c.c0] inc=4 len=144 sender=(2,2) seq=1
fg=iqb stat=KJUSERSTAT_DONE spnum=15 flg=x0
----- End of Customized Incident Dump(s) -----
STACK TRACE:
------------
Error: ORA-600 [kjmchkiseq:!seq] [1] [0] [2147483646] [255] [34] [0] [1] []
[] [] []
[00]: dbgexExplicitEndInc [diag_dde]
[01]: dbgeEndDDEInvocationImpl [diag_dde]
[02]: dbgeEndDDEInvocation [diag_dde]
[03]: kjmvalidate [RAC_MLMDS]<-- Signaling
2021/1/14 Document 23628685.8
https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=e7wrd5piw_236&id=23628685.8 3/3
[04]: kjmpbmsg [RAC_MLMDS]
[05]: kjmsm [RAC_MLMDS]
[06]: ksbrdp [background_proc]
[07]: opirip []
[08]: opidrv []
[09]: sou2o []
[10]: opimai_real []
[11]: ssthrdmain []
[12]: main []
[13]: __libc_start_main []

现象跟这个bug及其相似,糟糕的是oracle对于bug触发原因并没有过多的解释,咨询sr也没有过多解释,可通过打补丁修复bug。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值