ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因集群心跳丢失重启。该问题是BUG 10194190 18740837
导致的,修复该问题需要打patch 25142535。有一客户的solaris oracle 11.2.0.4 rac所有节点的DB已经打上补丁,但是
主机运行248天后,跑在上边的Oracle 11.2.0.4依然因集群心跳丢失而重启。根据ORACLE MOS官方回复,修复该BUG
的一个补丁包 patch 18740837是需要同时在GI软件上应用。
下边是相关客户案例的问题分析及解决处理总结。
1、问题节点DB告警日志报错提示
2、问题节点ASM告警日志报错提示
3、 +ASM2_lmhb_4609_i78497.trc文件内容
4、OCSSD日志提示心跳超时
5、+ASM2_lmhb_4609_i78497.trc文件局部信息
===[ Session State Object ]===
----------------------------------------
SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0
(session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: 0x0
edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476951 seq_num=11521 snap_id=1
wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec
wait times: max=0.100000 sec, heur=2 min 46 sec
wait counts: calls=1 os=1
in_wait=1 iflags=0x5a8
Wait State:
fixed_waits=0 flags=0x22 boundary=0x0/-1
Session Wait History:
elapsed time of 0.000015 sec since current wait
0: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476950 seq_num=11520 snap_id=1
wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000138 sec of elapsed time
1: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476949 seq_num=11519 snap_id=1
wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000015 sec of elapsed time
2: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476948 seq_num=11518 snap_id=1
wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000133 sec of elapsed time
3: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476947 seq_num=11517 snap_id=1
wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000013 sec of elapsed time
4: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476946 seq_num=11516 snap_id=1
wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000119 sec of elapsed time
5: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476945 seq_num=11515 snap_id=1
wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000012 sec of elapsed time
6: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476944 seq_num=11514 snap_id=1
wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000152 sec of elapsed time
7: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476943 seq_num=11513 snap_id=1
wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000016 sec of elapsed time
8: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476942 seq_num=11512 snap_id=1
wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000272 sec of elapsed time
9: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476941 seq_num=11511 snap_id=1
wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000013 sec of elapsed time
Sampled Session History of session 145 serial 1
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[121 samples, 21:08:20 - 21:10:20]
idle wait at each sample
temporary object counter: 0
----------------------------------------
Virtual Thread:
kgskvt: 3fc434ce8, sess: 3ffdb38d8 sid: 145 ser: 1
vc: 0, proc: 400b0c258, id: 145
consumer group cur: (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:
vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1
inwait: 0, short wait event: 0 posted_run: 0
location where insched last set: kgskthrrun
location where insched last cleared: kgskthrrun1
location where inwait last set: NULL
location where inwait last cleared: NULL
is_assigned: 0, in_sched: 0 (0)
qcls: 0, qlink: FALSE
vt_active: 0 (pending: 0)
vt_pq_active: 0, dop: 0
used quanta (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu start time: 0
idle time: 0, active time: 0 (cg: 0)
cpu yields:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu wait time (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
io waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
io wait time:
stmt: 0, accum: 0, mapped: 0, tot: 0
ASL queued time outs: 0, time: 0 (cur 0, cg 0)
PQQ queued time outs: 0, time: 0 (cur 0, cg 0)
Queue timeout violation: 0
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
I/O credits acquired:small=0 large=0
I/O credits waiting for:small=0 large=0
KTU Session Commit Cache Dump for IDLs:
KTU Session Commit Cache Dump for Non-IDLs:
----------------------------------------
KKS-UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 3eae04a58
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04aa8
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04af8
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04b48
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
----------------------------------------
KGL-UOL SO Cache(total=0, free=0)
KGX Atomic Operation Log 3eae047a0
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae047f8
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae04850
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae048a8
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae04900
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae04958
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
KGX Atomic Operation Log 3eae049b0
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4=0 pt5=0 ub4=0
----------------------------------------
SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0
(dummy) nxc=0, nlb=0
===[ Callstack ]===
*** 2019-05-21 21:10:20.275
Process diagnostic dump for oracle@sm0ora10 (LMON), OS id=4604,
pid: 9, proc_ser: 1, sid: 145, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 21:10:19 ]
NOTE: scheduling delay has not been sampled for 0.505539 secs 0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg
0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg
0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg
*** 2019-05-21 21:10:21.374
loadavg : 20.48 16.00 13.13
swap info: free_mem = 194124.91M rsv = 258509.04M
alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O grid 4604 4596 1 79 20 ? 168184 Sep 15 ? 781:33 asm_lmon_+ASM2
Short stack dump:
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<- thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224 <-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
*** 2019-05-21 21:10:22.421
==============================
LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
----- End of Abridged Call Stack Trace -----
6、主机启动时间
root@0ora10# hostname
ora10
root@ora10# uname -a
SunOS ora10 5.11 11.3 sun4v sparc sun4v
root@# uptime
11:28pm up 248 day(s) , 15:48, 8 users, load average: 6.21, 7.27, 8.09
6、根据ORA-15046和ORA-29770查询和TRC 堆栈调用信息提示:
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<- thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224 <-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380 ORACLE MOS官网,发现完全匹配上: (Doc ID 2159643.1)
Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770
7、后续查看问题系统补丁安装情况,发现rac所有节点的DB已经应用上 Doc ID 2159643.1 提到的补丁 25142535
oracle@ora10$opatch lsinventory|grep 25142535
Patch 25142535 : applied on Tue Jul 31 16:26:30 GMT+08:00 2018
oracle@ora10$
oracle@ora11$opatch lsinventory|grep 25142535
Patch 25142535 : applied on Tue Jul 31 16:26:30 GMT+08:00 2018
oracle@ora11$
但是GRID GI没有应用补丁 25142535
grid@ora10$ opatch lsinventory|grep 25142535
grid@ora10$
grid@ora11$ opatch lsinventory|grep 25142535
grid@ora11$
8、就此问题咨询MOS,ORACLE MOS给出的官方回复也没明确提示补丁 25142535确实是否需要在GI上应用
9、补丁 25142535是补丁 18740837和 10194190的合集,其中 18740837小补丁的readme中有提示RAC环境需要在GI应用
10、问题分析结论:
由目前的实际问题情况及8和9的信息汇总,可以确定 补丁 25142535需要在oracle 11.2.0.4 rac for solaris的所有节点的
GI集群软件上应用。可以肯定的是:当前的 oracle 11.2.0.4 rac for solaris asm and db crash and restart确实是rac在
solaris上运行248天导致的,在集群所有节点DB软件上打补丁25142535不能解决,官方也没有明确说在GI软件上同时应
用补丁25142535后到底能否解决:
Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770
如果有类似问题的客户看到博文,如果有通过GI打补丁25142535解决此问题,或者通过其他途径解决此问题欢迎:
评论留言,感谢!
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29357786/viewspace-2645319/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29357786/viewspace-2645319/