node2的alert.log
Sat Jul 09 16:41:28 CST 2011
Reconfiguration started (old inc 2, new inc 4)
List of nodes:
0 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Jul 09 16:41:29 CST 2011
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Sat Jul 09 16:41:30 CST 2011
LMS 0: 5074 GCS shadows traversed, 2242 replayed
Sat Jul 09 16:41:30 CST 2011
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
node1的alert.log(node2 被shutdown abort):
Sat Jul 09 17:32:37 CST 2011
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Jul 09 17:32:38 CST 2011
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Sat Jul 09 17:32:39 CST 2011
LMS 0: 5947 GCS shadows traversed, 0 replayed
Sat Jul 09 17:32:39 CST 2011
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Sat Jul 09 17:32:40 CST 2011
Instance recovery: looking for dead threads
Sat Jul 09 17:32:40 CST 2011
Beginning instance recovery of 1 threads
Sat Jul 09 17:32:42 CST 2011
Started redo scan
Sat Jul 09 17:32:46 CST 2011
Completed redo scan
3 redo blocks read, 5 data blocks need recovery
Sat Jul 09 17:32:46 CST 2011
Started redo application at
Thread 2: logseq 5, block 1884
Sat Jul 09 17:32:47 CST 2011
Recovery of Online Redo Log: Thread 2 Group 3 Seq 5 Reading mem 0
Mem# 0: +RAC_DISK/racdb/onlinelog/group_3.258.751759681
Sat Jul 09 17:32:47 CST 2011
Completed redo application
Sat Jul 09 17:32:47 CST 2011
Completed instance recovery at
Thread 2: logseq 5, block 1887, scn 532837
3 data blocks read, 5 data blocks written, 3 redo blocks read
Sat Jul 09 17:32:48 CST 2011
Thread 2 advanced to log sequence 6 (thread recovery)
这里涉及到一个重要的服务Cluster Group Service(CGS):
LMON:各个实例的LMON进程会定期通信,以检查集群中各节点的健康状态,当某个节点出现故障时, 负责集群 重构。它提供的服务叫Cluster Group Service(CGS),ORACLE
Clusterware使用Process Monitor Daemon解决脑裂的方法,如果某节点上的实例异常挂起,如果单从Network、OS、Clusterware几个层面 看,可能检测不到这种异常。因此数据
库必须有自我监控的机制。LMON进程提供了节点监控(Node Montor)功能。这个功能是用 来记录应用层各个节点的健康状态,节点的健康状态通过GRD中的一个位图bitmap记录,
每个节点一位,0代表关闭,1代表正常运行,各节点的LMON互相通信,确认这个位图的一致性。
LMON可以和下层的Clusterware合作也可以 单独工作。当LMON检测到实例级别的脑裂时,期待借助于Clusterware解决脑裂,但RAC并不假设Clusterware 肯定能解决问题 ,因
此LMON不会无尽等待Clusterware层的处理结果,当等待超时LMON进程会自动触发IMR(Instance Membership Recovery)IMR可以看做是ORACLE在数据库层提供的脑裂、IO隔离机制
。
LMON主要借助两种心跳来完成健康监测:
1、节点间的心跳
2、控制文件的磁盘心跳, 每个实例的CKPT进程 每3秒更新一次控制文件的Checkpoint Progress Record数据块,控制文件是 共享的,因此实例可以互相检测对方是否及时更新以判断状态。
LMON 相应的日志:
*** 2011-07-09 16:41:25.412
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 2 0.
*** 2011-07-09 16:41:25.570
Name Service frozen
kjxgmcs: Setting state to 2 1.
kjxgrssvote: reconfig bitmap chksum 0xccd0ae50 cnt 2 master 0 ret 0
kjxggpoll: change poll time to 50 ms
*** 2011-07-09 16:41:25.665
Obtained RR update lock for sequence 3, RR seq 2
*** 2011-07-09 16:41:25.752
Voting results, upd 0, seq 4, bitmap: 0 1
CGS/IMR TIMEOUTS:
CSS recovery timeout = 71 sec
IMR Reconfig timeout = 300 sec
CGS rcfg timeout = 300 sec
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 4 2.
kjfmuin: bitmap 0 1
kjfmmhi: received msg from 0 (inc 2)
kjfmmhi: received msg from 1 (inc 4)
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 4 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 4 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 4 5.
Name Service normal
Name Service recovery done
*** 2011-07-09 16:41:27.200
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 4 6.
kjxggpoll: change poll time to 600 ms
*** 2011-07-09 16:41:28.279
kjfcrfg: DRM window size = 128->128 (min lognb = 10)
*** 2011-07-09 16:41:28.279
Reconfiguration started (old inc 2, new inc 4)
Synchronization timeout interval: 900 sec
List of nodes:
0 1
Undo tsn affinity 1
*** 2011-07-09 16:41:28.311
*** 2011-07-09 16:41:28.311
kjfcrfg: query of NESTED_RECONFIGURATION for node 1 failed with 7
Global Resource Directory frozen
node 0
node 1
release 10 2 0 5
asby init, 0/0/x2
asby returns, 0/0/x2/false
* Domain maps before reconfiguration:
* DOMAIN 0 (valid 1): 0
* End of domain mappings
* Domain maps after recomputation:
* DOMAIN 0 (valid 1): 0 1
* End of domain mappings
Dead inst
Join inst 1
Exist inst 0
Active Sendback Threshold = 50 %
Communication channels reestablished
sent syncr inc 4 lvl 1 to 0 (4,5/0/0)
sent synca inc 4 lvl 1 (4,5/0/0)
received all domreplay (4.6)
sent master 0 (4.6)
*** 2011-07-09 16:41:29.535
KJBDOMHVMAP: BEGINS
*** 2011-07-09 16:41:29.560
KJBDOMHVMAP: ENDS
sent dom info (4.6)
sent hv info (4.6)
sent syncr inc 4 lvl 2 to 0 (4,7/0/0)
sent synca inc 4 lvl 2 (4,7/0/0)
Master broadcasted resource hash value bitmaps
* kjfcrfg: domain 0 valid, valid_ver = 4
Non-local Process blocks cleaned out
Set master node info
sent syncr inc 4 lvl 3 to 0 (4,13/0/0)
sent synca inc 4 lvl 3 (4,13/0/0)
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 1 = 774
sent syncr inc 4 lvl 4 to 0 (4,15/0/0)
sent synca inc 4 lvl 4 (4,15/0/0)
Dwn-cvts replayed, VALBLKs dubious
sent syncr inc 4 lvl 5 to 0 (4,18/0/0)
sent synca inc 4 lvl 5 (4,18/0/0)
All grantable enqueues granted
sent syncr inc 4 lvl 6 to 0 (4,20/0/0)
sent synca inc 4 lvl 6 (4,20/0/0)
Submitted all GCS cache requests
sent syncr inc 4 lvl 7 to 0 (4,22/0/0)
sent synca inc 4 lvl 7 (4,22/0/0)
Post SMON to start 1st pass IR
Fix write in gcs resources
sent syncr inc 4 lvl 8 to 0 (4,24/0/0)
sent synca inc 4 lvl 8 (4,24/0/0)
*** 2011-07-09 16:41:31.006
Reconfiguration complete
*** 2011-07-09 17:32:33.682
kjxgmpoll reconfig bitmap: 0
*** 2011-07-09 17:32:33.745
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 4 0.
*** 2011-07-09 17:32:34.157
Name Service frozen
kjxgmcs: Setting state to 4 1.
kjxgrssvote: reconfig bitmap chksum 0x6668604e cnt 1 master 0 ret 0
kjxggpoll: change poll time to 50 ms
*** 2011-07-09 17:32:34.464
Obtained RR update lock for sequence 5, RR seq 4
*** 2011-07-09 17:32:37.539
Voting results, upd 0, seq 6, bitmap: 0
CGS/IMR TIMEOUTS:
CSS recovery timeout = 71 sec
IMR Reconfig timeout = 300 sec
CGS rcfg timeout = 300 sec
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 6 2.
kjfmSendAbortInstMsg: send an abort message to node 1
kjfmSendAbortInstMsg: unique id 0x0 reason 0x1
kjfmuin: bitmap 0
kjfmmhi: received msg from 0 (inc 2)
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 6 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 6 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 6 5.
Name Service normal
Name Service recovery done
*** 2011-07-09 17:32:37.598
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 6 6.
kjxggpoll: change poll time to 600 ms
kjfmact: call ksimdic on instance (1)
*** 2011-07-09 17:32:37.843
kjfcrfg: DRM window size = 128->128 (min lognb = 10)
*** 2011-07-09 17:32:37.845
Reconfiguration started (old inc 4, new inc 6)
Synchronization timeout interval: 900 sec
List of nodes:
0
Undo tsn affinity 1
*** 2011-07-09 17:32:37.906
Global Resource Directory frozen
node 0
asby init, 0/0/x2
asby returns, 0/0/x2/false
* Domain maps before reconfiguration:
* DOMAIN 0 (valid 1): 0 1
* End of domain mappings
* kjbdomrcfg2: domain 0 invalid = TRUE
* Domain maps after recomputation:
* DOMAIN 0 (valid 0): 0
* End of domain mappings
Active Sendback Threshold = 50 %
Communication channels reestablished
sent syncr inc 6 lvl 1 to 0 (6,5/0/0)
sent syncr inc 6 lvl 2 to 0 (6,7/0/0)
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Set master node info
sent syncr inc 6 lvl 3 to 0 (6,13/0/0)
Submitted all remote-enqueue requests
sent syncr inc 6 lvl 4 to 0 (6,15/0/0)
Dwn-cvts replayed, VALBLKs dubious
sent syncr inc 6 lvl 5 to 0 (6,18/0/0)
All grantable enqueues granted
sent syncr inc 6 lvl 6 to 0 (6,20/0/0)
*** 2011-07-09 17:32:39.351
Post SMON to start 1st pass IR
Submitted all GCS cache requests
sent syncr inc 6 lvl 7 to 0 (6,22/0/0)
Fix write in gcs resources
sent syncr inc 6 lvl 8 to 0 (6,24/0/0)
*** 2011-07-09 17:32:39.673
Reconfiguration complete
* domain 0 valid?: 0
kjxgfipccb: msg 0x0xb7db2a6c, mbo 0x0xb7db2a68, type 19, ack 0, ref 0, stat 34来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/758322/viewspace-702235/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/758322/viewspace-702235/