背景
本文是文章:基于oracle 10.2.0.1 rac学习lmon进程系列六 :http://blog.itpub.net/9240380/viewspace-1839875/
的 延续,我们继续深入理解lmon进程的一些原理及含义,好我们进入正题。结论
1,如果rac中某个节点的lmon出现故障,并且其它节点关闭实例,会产生IPT TIMEPUT而且如果IPC TIMEOUT一直不解决,准确来说,就是某个节点的LMON不恢复正常,某个节点的任何操作会HANG住
2,某个节点lmon出现故障时,刚开始某个节点的告警日志及LMON TRC文件不会出现任何问题
但与此同时实例关闭哪个RAC节点的告警日志及LMON TRC会有相应信息
---告警日志
Tue Nov 17 03:54:45 2015
SHUTDOWN: waiting for detached processes to terminate.
----LMON TRC
*** 2015-11-17 03:49:40.116
Begin DRM(13)
*** 2015-11-17 03:49:41.148
Control file closed, terminating IMR --刚开始发现的信息,为何控制文件关闭,终止IMR,这个IMR是什么呢?
而且关闭RAC节点的SHUTDOWN IMMEDIATE
会在DISMOUNT这儿停一会儿
过一会儿,在出现LMON故障的节点及关闭实例的节点的告警日志及LMON TRC会出现ipc timeout以及oradebug ipc dump,
Tue Nov 17 03:54:34 2015
IPC Send timeout detected. Receiver ospid 17921
Tue Nov 17 03:54:35 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lmon_17921.trc:
Tue Nov 17 03:54:37 2015
Trace dumping is performing id=[cdmp_20151117035457]
续上,并且在关闭RAC实例的节点之LMON TRC会记录IPC 通讯相关的详细日志
KSTDUMP: In-memory trace dump
TIME:SEQ# ORAPID SID EVENT OP DATA
========================================================================
B2208504:001AEDB1 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=976354 ---从这些信息可以看出是RAC 2个节点的IPC通讯出现问题,rdbms ipc message,cgs wait for ipc msg
B2208532:001AEDB2 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B220853A:001AEDB3 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=8 --并且这里把进程号,会话号,操作类型相关重要信息也列在这儿
B220853D:001AEDB4 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B22F6DD3:001AEE0F 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=977044
B22F6DF6:001AEE10 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B22F6DFC:001AEE11 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=5
B22F6DFE:001AEE12 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B23E5C18:001AEE71 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=978455
B23E5C42:001AEE72 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B23E5C49:001AEE73 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=8
B23E5C4C:001AEE74 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
此时如果在出现故障的LMON节点进行查询,会HANG住,我估计任何的操作也会HANG住,因为数据库正常不正常,没有完成重配操作
然后我在出现故障的节点恢复LMON
从告警日志可见开始重配动作以及实例恢复
Tue Nov 17 04:10:40 2015
Reconfiguration started (old inc 24, new inc 26)
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Nov 17 04:10:40 2015
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Tue Nov 17 04:10:40 2015
LMS 0: 3470 GCS shadows traversed, 0 replayed
Tue Nov 17 04:10:40 2015
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Tue Nov 17 04:10:40 2015
Instance recovery: looking for dead threads ---可见重配完成后会进行实例恢复,即恢复NODE2节点2的相关中断事务或资源等信息
Instance recovery: lock domain invalid but no dead threads
3,只要数据库不重启,LMON进程的相关操作会记录同一个LMON TRC文件中,并且每次新的操作会产生一条日期相关的记录,也就是说分析时,只要分析与日期区配的相关数据即可
这是极为重要思路之一
[oracle@jingfa1 bdump]$ more jingfa1_lmon_17921.trc|grep "*** 2015-11-17 03"
*** 2015-11-17 03:48:57.367
*** 2015-11-17 03:54:35.697
*** 2015-11-17 03:54:35.697
测试
---oracle version
SQL> select * from v$version where rownum=1;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
---node1,hang lmon
oradebug suspend lmon
--同时在node2关闭节点的数据库实例,同时监控2个节点的告警日志及LMON的TRC文件
刚开始发现节点2关闭到database dismounted就停掉了一会儿,此时节点1的告警及LMON的TRC文件无任何信息
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
但是节点2有信息
---node2告警日志
发现出现IPC TIMEOUT的信息
Tue Nov 17 03:54:45 2015
SHUTDOWN: waiting for detached processes to terminate.
Tue Nov 17 03:54:54 2015
IPC Send timeout detected.Sender: ospid 25241
Receiver: inst 1 binc 432043878 ospid 17921
Communications reconfiguration: instance_number 1
Tue Nov 17 03:54:57 2015
Trace dumping is performing id=[cdmp_20151117035457]
Tue Nov 17 03:54:58 2015
freeing rdom 0
---node2 lmon trc
*** 2015-11-17 03:49:40.116
Begin DRM(13)
*** 2015-11-17 03:49:41.148
Control file closed, terminating IMR --刚开始发现的信息,为何控制文件关闭,终止IMR,这个IMR是什么呢?
---过了一会儿即告警日志出现IPC TIMEOUT对应出现的如下信息,我看和ORADEBUG IPC DUMP类似
*** 2015-11-17 03:54:54.470
*** 2015-11-17 03:54:54.470
IPC Send timeout detected.Sender: ospid 25241
Receiver: inst 1 binc 432043878 ospid 17921
SKGXPCTX: 0x0xf1d5618 ctx
WAIT HISTORY
Time(msec) Wait Type Return Code
---------- --------- ------------
1004 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
974 NORMAL TIMEDOUT
27 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1001 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1002 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1002 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1004 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1029 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1004 NORMAL TIMEDOUT
wait delta 0 sec (0 msec) ctx ts 0x56340f last ts 0x563413
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 2
timed wait receives 0
admno 0x63c82721 admport:
SSKGXPT 0xf1d5af4 flags info for network 0
socket no 9 IP 10.10.10.9 UDP 54219
sflags SSKGXPT_WRITESSKGXPT_UP
info for network 1
socket no 0 IP 0.0.0.0 UDP 0
sflags SSKGXPT_DOWN
active 0 actcnt 1
context timestamp 0x56340f
buffers queued on port 0x2af02751b148
sconno accono ertt state seq# sent async sync rtrans acks
0x23c75d22 0x2da31a1b 32 3 47210280550785 390 397 0 2418 47210280517960
slot 0 rqh=0x2af027a16100
seq=33152 len=472 accno=0x2da31a1b start TS=0x515deb rt TS=0x563767 X CNT=303
slot 1 rqh=0x2af027a16d00
seq=33145 len=472 accno=0x2da31a1b start TS=0x515a5a rt TS=0x5637e7 X CNT=304
slot 2 rqh=0x2af027a14900
seq=33146 len=472 accno=0x2da31a1b start TS=0x515a5a rt TS=0x5637e7 X CNT=304
slot 3 rqh=0x2af027a14500
seq=33147 len=472 accno=0x2da31a1b start TS=0x515b89 rt TS=0x563527 X CNT=303
slot 4 rqh=0x2af027a10520
seq=33148 len=472 accno=0x2da31a1b start TS=0x515b8a rt TS=0x563527 X CNT=303
slot 5 rqh=0x2af027a16500
seq=33149 len=472 accno=0x2da31a1b start TS=0x515cbd rt TS=0x563667 X CNT=303
slot 6 rqh=0x2af027a14100
seq=33150 len=472 accno=0x2da31a1b start TS=0x515cbd rt TS=0x563667 X CNT=303
slot 7 rqh=0x2af027a18500
seq=33151 len=472 accno=0x2da31a1b start TS=0x515deb rt TS=0x563767 X CNT=303
0x23c75d23 0x24cf7936 64 3 47210280550395 0 0 0 0 47210280517632
0x23c75d24 0x1bfbd842 64 3 47210280550395 0 0 0 0 47210280517632
ach accono sconno admno state seq# rcv rtrans acks
0x0f259e20 0x4daa7926 0x524cf969 0x3f03d9f5 47210280517672 33148 385 0 321
Submitting synchronized dump request [268435460]
kjxgfipccb: msg 0x0x2af027af16c8, mbo 0x0x2af027af16c0, type 22, ack 0, ref 0, stat 32
------ Dumping SKGXP context ------
SKGXPCTX: 0x0xf1d5618 ctx
WAIT HISTORY
Time(msec) Wait Type Return Code
---------- --------- ------------
1004 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
974 NORMAL TIMEDOUT
27 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1001 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1002 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1002 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1004 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1029 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
1004 NORMAL TIMEDOUT
wait delta 2 sec (2978 msec) ctx ts 0x56340f last ts 0x563413
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 2
timed wait receives 0
admno 0x63c82721 admport:
SSKGXPT 0xf1d5af4 flags info for network 0
socket no 9 IP 10.10.10.9 UDP 54219
sflags SSKGXPT_WRITESSKGXPT_UP
info for network 1
socket no 0 IP 0.0.0.0 UDP 0
sflags SSKGXPT_DOWN
active 0 actcnt 1
context timestamp 0x56340f
buffers queued on port 0x2af02751b148
sconno accono ertt state seq# sent async sync rtrans acks
0x23c75d23 0x24cf7936 64 3 47210280550395 0 0 0 0 47210280517632
0x23c75d24 0x1bfbd842 64 3 47210280550395 0 0 0 0 47210280517632
ach accono sconno admno state seq# rcv rtrans acks
0x0f259e20 0x4daa7926 0x524cf969 0x3f03d9f5 47210280517672 33148 385 0 321
------ Dumping KST traces ------
*** 2015-11-17 03:54:57.448
KSTDUMP: In-memory trace dump
TIME:SEQ# ORAPID SID EVENT OP DATA
========================================================================
B2208504:001AEDB1 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=976354 ---从这些信息可以看出是RAC 2个节点的IPC通讯出现问题,rdbms ipc message,cgs wait for ipc msg
B2208532:001AEDB2 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B220853A:001AEDB3 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=8 --并且这里把进程号,会话号,操作类型相关重要信息也列在这儿
B220853D:001AEDB4 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B22F6DD3:001AEE0F 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=977044
B22F6DF6:001AEE10 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B22F6DFC:001AEE11 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=5
B22F6DFE:001AEE12 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B23E5C18:001AEE71 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=978455
B23E5C42:001AEE72 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B23E5C49:001AEE73 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=8
B23E5C4C:001AEE74 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B24D49D0:001AEED1 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=978257
B24D4AE2:001AEED2 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B24D4AEF:001AEED3 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=14
B24D4AF3:001AEED4 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B25C3D83:001AEF33 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=979595
B25C3DCA:001AEF34 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B25C3DDE:001AEF35 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=22
B25C3DE3:001AEF36 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B26B3470:001AEF95 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=980587
B26B352D:001AEF96 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B26B35A5:001AEF97 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=74
B26B35AB:001AEF98 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B27A1DF3:001AEFF5 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=976916
B27A1F12:001AEFF6 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B27A1F58:001AEFF7 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=94
B27A1F5F:001AEFF8 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B28910B9:001AF059 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=979286
B289111D:001AF05A 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B2891124:001AF05B 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=8
B2891127:001AF05C 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B2980FEE:001AF0BB 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=982666
B29810C3:001AF0BC 5 167 10005 1 KSL WAIT BEG [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0
B29810CC:001AF0BD 5 167 10005 2 KSL WAIT END [CGS wait for IPC msg] 0/0x0 0/0x0 0/0x0 time=10
B29810D0:001AF0BE 5 167 10005 1 KSL WAIT BEG [rdbms ipc message] 100/0x64 0/0x0 0/0x0
B2A705C0:001AF11D 5 167 10005 2 KSL WAIT END [rdbms ipc message] 100/0x64 0/0x0 0/0x0 time=980205
--过了一会儿,发现在节点1也出现信息
--node 1告警日志
出现ipc timeout,而且ipc timeout也会出现多次,不过每次相关的进程不同
Tue Nov 17 03:54:34 2015
IPC Send timeout detected. Receiver ospid 17921
Tue Nov 17 03:54:35 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lmon_17921.trc:
Tue Nov 17 03:54:37 2015
Trace dumping is performing id=[cdmp_20151117035457]
Tue Nov 17 04:00:15 2015
IPC Send timeout detected.Sender: ospid 17925
Receiver: inst 2 binc 432109341 ospid 25245
Tue Nov 17 04:00:15 2015
IPC Send timeout to 1.1 inc 24 for msg type 73 from opid 7
Tue Nov 17 04:04:18 2015
IPC Send timeout detected.Sender: ospid 931
Receiver: inst 2 binc 432109335 ospid 25243
Tue Nov 17 04:04:18 2015
IPC Send timeout to 1.0 inc 24 for msg type 12 from opid 18
---node1 lmon trc,这种方法可以分析lmon trc,也就是lmon trc会一直记录数据库LMON进程的相关操作活动,每次产生新的活动,会对应一个新的操作日期,可以基于这个日期进行分析,
而不要无针对性和目的性进行分析,这是极为重要的
[oracle@jingfa1 bdump]$ more jingfa1_lmon_17921.trc|grep "*** 2015-11-17 03"
*** 2015-11-17 03:48:57.367
*** 2015-11-17 03:54:35.697
*** 2015-11-17 03:54:35.697
---node1的lmon trc
*** 2015-11-17 03:54:35.697
Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 17917, image:
*** 2015-11-17 03:54:35.697
ksedmp: internal or fatal error
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
内容略
WAIT HISTORY
Time(msec) Wait Type Return Code
---------- --------- ------------
0 NORMAL TIMEDOUT
101 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
100 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
100 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
101 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
100 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
99 NORMAL TIMEDOUT
2 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
100 NORMAL TIMEDOUT
0 NORMAL TIMEDOUT
wait delta 339 sec (339493 msec) ctx ts 0x5add78 last ts 0x5add78
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 4
timed wait receives 0
admno 0x3f03d9f5 admport:
timed wait receives 0
admno 0x3f03d9f5 admport:
SSKGXPT 0xcf74af4 flags info for network 0
socket no 9 IP 10.10.10.8 UDP 44498
sflags SSKGXPT_UP
info for network 1
socket no 0 IP 0.0.0.0 UDP 0
sflags SSKGXPT_DOWN
active 0 actcnt 1
context timestamp 0x5add78
buffers queued on port 0x2b13927fb148
sconno accono ertt state seq# sent async sync rtrans acks
0x524cf969 0x4daa7926 32 3 47360604406140 385 385 0 0 47360604373313
0x524cf96a 0x44d6d837 64 3 47360604405755 0 0 0 0 47360604372992
0x524cf96b 0x3c033748 64 3 47360604405755 0 0 0 0 47360604372992
ach accono sconno admno state seq# rcv rtrans acks
0x0cff90c0 0x2da31a1b 0x23c75d22 0x63c82721 47360604373032 33145 485 0 394
而且其后在节点1运行查询也会HANG住,即lmon进程出现故障,可能会导致数据库DML及查询HANG住
SQL> select * from v$version where rownum=1;
我马上恢复节点1的lmon,节点1马上进行重配操作,可见LMON进程极为重要,可见如果LMON出现故障,当前实例无法获取到另一个故障实例的相关资源配置情况,RAC无法完成重配操作,数据库就会处于不一致的情况
进而引发RAC HANG或其它严重的问题
Tue Nov 17 04:10:40 2015
Reconfiguration started (old inc 24, new inc 26)
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Nov 17 04:10:40 2015
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Tue Nov 17 04:10:40 2015
LMS 0: 3470 GCS shadows traversed, 0 replayed
Tue Nov 17 04:10:40 2015
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Tue Nov 17 04:10:40 2015
Instance recovery: looking for dead threads ---可见重配完成后会进行实例恢复,即恢复NODE2节点2的相关中断事务或资源等信息
Instance recovery: lock domain invalid but no dead threads
---节点1的SQL查询也恢复正常
SQL> select * from v$version where rownum=1;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9240380/viewspace-1839890/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/9240380/viewspace-1839890/