环境:
数 据 库:ORACLE RAC 11.2.0.4.0
主机系统:Red Hat Enterprise Linux Server release 6.6(Santiago)
集群日志:
2016-11-03 16:49:09.591: [ CSSD][4028430080]clssgmQueueGrockRequest: queued msg from node(1), for operation 4, RPC#1181, generation 47, tag<0x7f55a0047e00>
2016-11-03 16:49:13.204: [ CSSD][3820443392]clssnmSendingThread: sending status msg to all nodes
2016-11-03 16:49:13.204: [ CSSD][3820443392]clssnmSendingThread: sent 5 status msgs to all nodes
2016-11-03 16:49:13.720: [ CSSD][3822020352]clssnmPollingThread: node bjltj1dw02 (2) at 50% heartbeat fatal, removal in 14.610 seconds
2016-11-03 16:49:13.720: [ CSSD][3822020352]clssnmPollingThread: node bjltj1dw02 (2) is impending reconfig, flag 2294796, misstime 15390
2016-11-03 16:49:13.720: [ CSSD][3822020352]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2016-11-03 16:49:13.720: [ CSSD][4042692352]clssnmvDHBValidateNcopy: node 2, bjltj1dw02, has a disk HB, but no network HB, DHB has rcfg 358519447, wrtcnt, 48653087, LATS 3373852946, lastSeqNo 48652997, uniqueness 1463024655, timestamp 1478162953/3373861536
2016-11-03 16:49:13.720: [ CSSD][4037945088]clssnmvDHBValidateNcopy: node 2, bjltj1dw02, has a disk HB, but no network HB, DHB has rcfg 358519447, wrtcnt, 48653088, LATS 3373852946, lastSeqNo 48652998, uniqueness 1463024655, timestamp 1478162953/3373861596
2016-11-03 16:49:13.750: [ CSSD][4045903616]clssnmvDiskPing: Writing with status 0x3, timestamp 1478162953/3373852976
2016-11-03 16:49:14.190: [ CSSD][4041099008]clssnmvDiskPing: Writing with status 0x3, timestamp 1478162954/3373853416
2016-11-03 16:49:14.221: [ CSSD][4033189632]clssnmvDHBValidateNcopy: node 2, bjltj1dw02, has a disk HB, but no network HB, DHB has rcfg 358519447, wrtcnt, 48653089, LATS 3373853446, lastSeqNo 48652984, uniqueness 1463024655, timestamp 1478162954/3373862336
2016-11-03 16:49:14.250: [ CSSD][4036343552]clssnmvDiskPing: Writing with status 0x3, timestamp 1478162954/3373853476
2016-11-03 16:49:14.251: [ CSSD][4045903616]clssnmvDiskPing: Writing with status 0x3, timestamp 1478162954/3373853476
2016-11-03 16:49:14.691: [ CSSD][4041099008]clssnmvDiskPing: Writing with status 0x3, timestamp 1478162954/3373853916
2016-11-03 16:49:14.720: [ CSSD][4042692352]clssnmvDHBValidateNcopy: node 2, bjltj1dw02, has a disk HB, but no network HB, DHB has rcfg 358519447, wrtcnt, 48653090, LATS 3373853946, lastSeqNo 48653087, uniqueness 1463024655, timestamp 1478162954/3373862536
检查确定由于私网心跳网络致命导致node2被移除和假死状态。
执行计划
1、停止node1数据库实例;
2、两节点私网更换为专用交换机,检查私网通讯状态,成功后继续下一步,否则检查服务器配置和硬件问题;
3、启动node2集群服务,检查两节点集群服务是否正常:如果正常继续执行下一步,否则放弃此计划;
4、启动两节点数据库实例,检查数据库状态。如果失败,恢复单节点对外提供服务
运维操作步骤
停止node1的数据库服务
$su- oracle
$sqlplus/ as sysdba
SQL>shutdownimmediate
更换交换机
网络工程师接入交换机,系统工程师更换私有网络到专用交换机上,数据库工程师检查私网通讯,确定私有网络无法通讯;
私网故障解决
系统工程师解决私有网络通讯问题,更换光纤模块后发现网卡灯状态异常,向数据库工程师申请停掉集群更换网卡;
停止node1集群服务
[root@bjltj1dw01~]$su - root
[root@bjltj1dw01~]$/u01/app/11.2.0/grid/bin/crsctl stop crs
更换网卡
主机工程师更换网卡后,启动node 1,解决私网通讯故障;
启动集群服务
数据库工程师启动node1集群服务
[root@bjltj1dw01~]$su - root
[root@bjltj1dw01~]$/u01/app/11.2.0/grid/bin/crsctlstart crs
检查node1集群服务状态正常,启动node2 集群服务
[root@bjltj1dw02~] su - root
[root@bjltj1dw02~]$/u01/app/11.2.0/grid/bin/crsctlstart crs
检查node1、node2集群服务均正常。
启动两节点数据库实例
[root@bjltj1dw01~]$su - oracle
[oracle@bjltj1dw01~]$srvctl start database -d bjltj1dw
确定两节点集群服务正常,数据库Open。
登录sqlplus检查数据库状态
检查数据库实例状态,数据库运行正常
[oracle@bjltj1dw01~]$sqlplus / as sysdba
SQL>setpages 200 lines 200
SQL>selectinstance_name,status from gv$instance;