Oracle RAC(Cluster)的重构整理（3）

最新推荐文章于 2021-04-09 10:08:10 发布

cqhiabc50405

最新推荐文章于 2021-04-09 10:08:10 发布

阅读量175

点赞数

文章标签：数据库

node2的alert.log

Sat Jul 09 16:41:28 CST 2011

Reconfiguration started (old inc 2, new inc 4)

List of nodes:

0 1

Global Resource Directory frozen

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Sat Jul 09 16:41:29 CST 2011

LMS 0: 0 GCS shadows cancelled, 0 closed

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Sat Jul 09 16:41:30 CST 2011

LMS 0: 5074 GCS shadows traversed, 2242 replayed

Sat Jul 09 16:41:30 CST 2011

Submitted all GCS remote-cache requests

Post SMON to start 1st pass IR

Fix write in gcs resources

Reconfiguration complete

node1的alert.log(node2 被shutdown abort):

Sat Jul 09 17:32:37 CST 2011

Reconfiguration started (old inc 4, new inc 6)

List of nodes:

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Sat Jul 09 17:32:38 CST 2011

LMS 0: 0 GCS shadows cancelled, 0 closed

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Post SMON to start 1st pass IR

Sat Jul 09 17:32:39 CST 2011

LMS 0: 5947 GCS shadows traversed, 0 replayed

Sat Jul 09 17:32:39 CST 2011

Submitted all GCS remote-cache requests

Fix write in gcs resources

Reconfiguration complete

Sat Jul 09 17:32:40 CST 2011

Instance recovery: looking for dead threads

Sat Jul 09 17:32:40 CST 2011

Beginning instance recovery of 1 threads

Sat Jul 09 17:32:42 CST 2011

Started redo scan

Sat Jul 09 17:32:46 CST 2011

Completed redo scan

3 redo blocks read, 5 data blocks need recovery

Sat Jul 09 17:32:46 CST 2011

Started redo application at

Thread 2: logseq 5, block 1884

Sat Jul 09 17:32:47 CST 2011

Recovery of Online Redo Log: Thread 2 Group 3 Seq 5 Reading mem 0

Mem# 0: +RAC_DISK/racdb/onlinelog/group_3.258.751759681

Sat Jul 09 17:32:47 CST 2011

Completed redo application

Sat Jul 09 17:32:47 CST 2011

Completed instance recovery at

Thread 2: logseq 5, block 1887, scn 532837

3 data blocks read, 5 data blocks written, 3 redo blocks read

Sat Jul 09 17:32:48 CST 2011

Thread 2 advanced to log sequence 6 (thread recovery)

这里涉及到一个重要的服务Cluster Group Service（CGS）：

LMON：各个实例的LMON进程会定期通信，以检查集群中各节点的健康状态，当某个节点出现故障时，负责集群重构。它提供的服务叫Cluster Group Service（CGS），ORACLE

Clusterware使用Process Monitor Daemon解决脑裂的方法，如果某节点上的实例异常挂起，如果单从Network、OS、Clusterware几个层面看，可能检测不到这种异常。因此数据

库必须有自我监控的机制。LMON进程提供了节点监控（Node Montor）功能。这个功能是用来记录应用层各个节点的健康状态，节点的健康状态通过GRD中的一个位图bitmap记录，

每个节点一位，0代表关闭，1代表正常运行，各节点的LMON互相通信，确认这个位图的一致性。

LMON可以和下层的Clusterware合作也可以单独工作。当LMON检测到实例级别的脑裂时，期待借助于Clusterware解决脑裂，但RAC并不假设Clusterware 肯定能解决问题，因

此LMON不会无尽等待Clusterware层的处理结果，当等待超时LMON进程会自动触发IMR（Instance Membership Recovery）IMR可以看做是ORACLE在数据库层提供的脑裂、IO隔离机制

。

LMON主要借助两种心跳来完成健康监测：

1、节点间的心跳

2、控制文件的磁盘心跳，每个实例的CKPT进程每3秒更新一次控制文件的Checkpoint Progress Record数据块，控制文件是共享的，因此实例可以互相检测对方是否及时更新以判断状态。

LMON 相应的日志：

*** 2011-07-09 16:41:25.412

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 2 0.

*** 2011-07-09 16:41:25.570

Name Service frozen

kjxgmcs: Setting state to 2 1.

kjxgrssvote: reconfig bitmap chksum 0xccd0ae50 cnt 2 master 0 ret 0

kjxggpoll: change poll time to 50 ms

*** 2011-07-09 16:41:25.665

Obtained RR update lock for sequence 3, RR seq 2

*** 2011-07-09 16:41:25.752

Voting results, upd 0, seq 4, bitmap: 0 1

CGS/IMR TIMEOUTS:

CSS recovery timeout = 71 sec

IMR Reconfig timeout = 300 sec

CGS rcfg timeout = 300 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 4 2.

kjfmuin: bitmap 0 1

kjfmmhi: received msg from 0 (inc 2)

kjfmmhi: received msg from 1 (inc 4)

Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 4 3.

Name Service recovery started

Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 4 4.

Multicasted all local name entries for publish

Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 4 5.

Name Service normal

Name Service recovery done

*** 2011-07-09 16:41:27.200

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 4 6.

kjxggpoll: change poll time to 600 ms

*** 2011-07-09 16:41:28.279

kjfcrfg: DRM window size = 128->128 (min lognb = 10)

*** 2011-07-09 16:41:28.279

Reconfiguration started (old inc 2, new inc 4)

Synchronization timeout interval: 900 sec

List of nodes:

0 1

Undo tsn affinity 1

*** 2011-07-09 16:41:28.311

kjfcrfg: query of NESTED_RECONFIGURATION for node 1 failed with 7

Global Resource Directory frozen

node 0

node 1

release 10 2 0 5

asby init, 0/0/x2

asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

* DOMAIN 0 (valid 1): 0

* End of domain mappings

* Domain maps after recomputation:

* DOMAIN 0 (valid 1): 0 1

* End of domain mappings

Dead inst

Join inst 1

Exist inst 0

Active Sendback Threshold = 50 %

Communication channels reestablished

sent syncr inc 4 lvl 1 to 0 (4,5/0/0)

sent synca inc 4 lvl 1 (4,5/0/0)

received all domreplay (4.6)

sent master 0 (4.6)

*** 2011-07-09 16:41:29.535

KJBDOMHVMAP: BEGINS

*** 2011-07-09 16:41:29.560

KJBDOMHVMAP: ENDS

sent dom info (4.6)

sent hv info (4.6)

sent syncr inc 4 lvl 2 to 0 (4,7/0/0)

sent synca inc 4 lvl 2 (4,7/0/0)

Master broadcasted resource hash value bitmaps

* kjfcrfg: domain 0 valid, valid_ver = 4

Non-local Process blocks cleaned out

Set master node info

sent syncr inc 4 lvl 3 to 0 (4,13/0/0)

sent synca inc 4 lvl 3 (4,13/0/0)

Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 1 = 774

sent syncr inc 4 lvl 4 to 0 (4,15/0/0)

sent synca inc 4 lvl 4 (4,15/0/0)

Dwn-cvts replayed, VALBLKs dubious

sent syncr inc 4 lvl 5 to 0 (4,18/0/0)

sent synca inc 4 lvl 5 (4,18/0/0)

All grantable enqueues granted

sent syncr inc 4 lvl 6 to 0 (4,20/0/0)

sent synca inc 4 lvl 6 (4,20/0/0)

Submitted all GCS cache requests

sent syncr inc 4 lvl 7 to 0 (4,22/0/0)

sent synca inc 4 lvl 7 (4,22/0/0)

Post SMON to start 1st pass IR

Fix write in gcs resources

sent syncr inc 4 lvl 8 to 0 (4,24/0/0)

sent synca inc 4 lvl 8 (4,24/0/0)

*** 2011-07-09 16:41:31.006

Reconfiguration complete

*** 2011-07-09 17:32:33.682

kjxgmpoll reconfig bitmap: 0

*** 2011-07-09 17:32:33.745

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 4 0.

*** 2011-07-09 17:32:34.157

Name Service frozen

kjxgmcs: Setting state to 4 1.

kjxgrssvote: reconfig bitmap chksum 0x6668604e cnt 1 master 0 ret 0

kjxggpoll: change poll time to 50 ms

*** 2011-07-09 17:32:34.464

Obtained RR update lock for sequence 5, RR seq 4

*** 2011-07-09 17:32:37.539

Voting results, upd 0, seq 6, bitmap: 0

CGS/IMR TIMEOUTS:

CSS recovery timeout = 71 sec

IMR Reconfig timeout = 300 sec

CGS rcfg timeout = 300 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 6 2.

kjfmSendAbortInstMsg: send an abort message to node 1

kjfmSendAbortInstMsg: unique id 0x0 reason 0x1

kjfmuin: bitmap 0

kjfmmhi: received msg from 0 (inc 2)

Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 6 3.

Name Service recovery started

Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 6 4.

Multicasted all local name entries for publish

Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 6 5.

Name Service normal

Name Service recovery done

*** 2011-07-09 17:32:37.598

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 6 6.

kjxggpoll: change poll time to 600 ms

kjfmact: call ksimdic on instance (1)

*** 2011-07-09 17:32:37.843

kjfcrfg: DRM window size = 128->128 (min lognb = 10)

*** 2011-07-09 17:32:37.845

Reconfiguration started (old inc 4, new inc 6)

Synchronization timeout interval: 900 sec

List of nodes:

Undo tsn affinity 1

*** 2011-07-09 17:32:37.906

Global Resource Directory frozen

node 0

asby init, 0/0/x2

asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

* DOMAIN 0 (valid 1): 0 1

* End of domain mappings

* kjbdomrcfg2: domain 0 invalid = TRUE

* Domain maps after recomputation:

* DOMAIN 0 (valid 0): 0

* End of domain mappings

Active Sendback Threshold = 50 %

Communication channels reestablished

sent syncr inc 6 lvl 1 to 0 (6,5/0/0)

sent syncr inc 6 lvl 2 to 0 (6,7/0/0)

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Set master node info

sent syncr inc 6 lvl 3 to 0 (6,13/0/0)

Submitted all remote-enqueue requests

sent syncr inc 6 lvl 4 to 0 (6,15/0/0)

Dwn-cvts replayed, VALBLKs dubious

sent syncr inc 6 lvl 5 to 0 (6,18/0/0)

All grantable enqueues granted

sent syncr inc 6 lvl 6 to 0 (6,20/0/0)

*** 2011-07-09 17:32:39.351

Post SMON to start 1st pass IR

Submitted all GCS cache requests

sent syncr inc 6 lvl 7 to 0 (6,22/0/0)

Fix write in gcs resources

sent syncr inc 6 lvl 8 to 0 (6,24/0/0)

*** 2011-07-09 17:32:39.673

Reconfiguration complete

* domain 0 valid?: 0

kjxgfipccb: msg 0x0xb7db2a6c, mbo 0x0xb7db2a68, type 19, ack 0, ref 0, stat 34

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/758322/viewspace-702235/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/758322/viewspace-702235/

cqhiabc50405

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Oracle RAC(Cluster)的重构整理（3）

node2的alert.logSat Jul 09 16:41:28 CST 2011Reconfiguration started (old inc 2, new inc 4)List of nodes:0 ...
复制链接

扫一扫

Oracle RAC(Cluster)的重构整理（3）

“相关推荐”对你有帮助么？