节点2主机关停之后,VIP并没有failover到节点一

原创 2013年12月17日 22:42:52

现象:

节点2主机关停之后,VIP并没有failover到节点一

如下所示,在节点一查看,VIP并没有FAILOVER过来。
[root@MAA01 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr A4:BA:DB:13:E2:AB
inet addr:10.8.32.111 Bcast:10.0.15.255 Mask:255.255.255.0
 inet6 addr: fe80::a6ba:dbff:fe13:e2ab/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:12896778 errors:0 dropped:0 overruns:0 frame:0
 TX packets:9488933 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
RX bytes:2875695560 (2.6 GiB) TX bytes:2411913446 (2.2 GiB)
 Interrupt:114 Memory:d6000000-d6012800

eth0:1 Link encap:Ethernet HWaddr A4:BA:DB:13:E2:AB
inet addr:10.8.32.115 Bcast:10.0.15.255 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 Interrupt:114 Memory:d6000000-d6012800

eth1 Link encap:Ethernet HWaddr A4:BA:DB:13:E2:AD
inet addr:192.168.127.101 Bcast:192.168.127.255 Mask:255.255.255.0
 inet6 addr: fe80::a6ba:dbff:fe13:e2ad/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:203865 errors:0 dropped:0 overruns:0 frame:0
 TX packets:309076 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
RX bytes:217218808 (207.1 MiB) TX bytes:66031839 (62.9 MiB)
 Interrupt:122 Memory:d8000000-d8012800

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
 inet6 addr: ::1/128 Scope:Host
 UP LOOPBACK RUNNING MTU:16436 Metric:1
 RX packets:2264776 errors:0 dropped:0 overruns:0 frame:0
 TX packets:2264776 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
RX bytes:3249461652 (3.0 GiB) TX bytes:3249461652 (3.0 GiB)

[oracle@MAA01 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....SM1.asm application ONLINE ONLINE maa01
ora....01.lsnr application ONLINE ONLINE maa01
ora....t01.gsd application ONLINE ONLINE maa01
ora....t01.ons application ONLINE ONLINE maa01
ora....t01.vip application ONLINE ONLINE maa01
ora....SM2.asm application ONLINE OFFLINE
ora....02.lsnr application ONLINE OFFLINE
ora....t02.gsd application ONLINE OFFLINE
ora....t02.ons application ONLINE OFFLINE
ora....t02.vip application ONLINE OFFLINE
ora.rac.db application ONLINE ONLINE maa01
ora....c1.inst application ONLINE ONLINE maa01
ora....c2.inst application ONLINE OFFLINE
ora...._taf.cs application OFFLINE OFFLINE
ora....ac1.srv application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
ora....rac1.cs application OFFLINE OFFLINE
ora....ac1.srv application OFFLINE OFFLINE
ora....rac2.cs application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
[oracle@MAA01 ~]$

此时,在节点1上ping节点2,无法ping通:[oracle@MAA01 ~]$ ping 10.8.32.112
PING 10.8.32.112 (10.8.32.112) 56(84) bytes of data.
From 10.8.32.111 icmp_seq=1 Destination Host Unreachable
From 10.8.32.111 icmp_seq=2 Destination Host Unreachable

分析:
查看了节点1的监听配置文件,未发现有异常:
$CRS_HOME/log/<nodename>/*.log
$CRS_HOME/log/<nodename>/crsd/*.log
$CRS_HOME/log/<nodename>/cssd/*.log
$ORACLE_HOME/network/admin/listener.ora

[oracle@MAA01 ~]$
[oracle@MAA01 ~]$ cd $ORACLE_HOME
[oracle@MAA01 db]$ cd network/admin/
[oracle@MAA01 admin]$ cat listener.ora
# listener.ora.maa01 Network Configuration File: /oracle/app/11gR1/db/network/admin/listener.ora.maa01
# Generated by Oracle configuration tools.

INBOUND_CONNECT_TIMEOUT_LISTENER_MAA01=180

LISTENER_MAA01 =
 (DESCRIPTION_LIST =
 (DESCRIPTION =
 (ADDRESS_LIST =
 (ADDRESS = (PROTOCOL = TCP)(HOST = MAA01-vip)(PORT = 1521)(IP = FIRST))
 )
 (ADDRESS_LIST =
 (ADDRESS = (PROTOCOL = TCP)(HOST = 10.8.32.111)(PORT = 1521)(IP = FIRST))
 )
 (ADDRESS_LIST =
 (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
 )
 )
 )

查看节点1的相关日志文件,发现尝CRS进行了failover vip的尝试,但失败了。
[oracle@MAA01 admin]$

crsd.log:

[crsd(5072)]CRS-1201:CRSD started on node maa01.
2013-08-09 17:18:57.513
[crsd(5072)]CRS-1205:Auto-start failed for the CRS resource . Details in maa01.
2013-08-09 17:28:01.175
[cssd(5555)]CRS-1612:node joadbtest02 (2) at 50% heartbeat fatal, eviction in 14.102 seconds
2013-08-09 17:28:02.177
[cssd(5555)]CRS-1612:node joadbtest02 (2) at 50% heartbeat fatal, eviction in 13.102 seconds
2013-08-09 17:28:09.181
[cssd(5555)]CRS-1611:node joadbtest02 (2) at 75% heartbeat fatal, eviction in 6.102 seconds
2013-08-09 17:28:13.179
[cssd(5555)]CRS-1610:node joadbtest02 (2) at 90% heartbeat fatal, eviction in 2.102 seconds
2013-08-09 17:28:14.181
[cssd(5555)]CRS-1610:node joadbtest02 (2) at 90% heartbeat fatal, eviction in 1.102 seconds
2013-08-09 17:28:15.183
[cssd(5555)]CRS-1610:node joadbtest02 (2) at 90% heartbeat fatal, eviction in 0.092 seconds <--------------heart beat loss
2013-08-09 17:28:16.045
[cssd(5555)]CRS-1607:CSSD evicting node joadbtest02. Details in /oracle/app/11gR1/crs/log/maa01/cssd/ocssd.log.
[cssd(5555)]CRS-1601:CSSD Reconfiguration complete. Active nodes are maa01 .        <----------------------------------------------------Node2 was evicted


alertmaa01.log:

[ CSSD]2013-08-09 17:28:31.188 [1158809920] >TRACE: clssnmUpdateNodeState: node 2, state (5/0) unique (1371182914/1371182914) prevConuni(1371182914) birth (244117402/244117402) (old/new)
[ CSSD]2013-08-09 17:28:31.188 [1158809920] >TRACE: clssnmDeactivateNode: node 2 (joadbtest02) left cluster

ocssd.log:

2013-08-09 17:18:57.506: [ CRSRES][1488656704] startRunnable: setting CLI values
2013-08-09 17:18:57.512: [ CRSRES][1486555456] maa01 : CRS-1019: Resource ora.joadbtest02.LISTENER_JOADBTEST02.lsnr (application) cannot run on maa01


2013-08-09 17:18:57.519: [ CRSRES][1488656704] Attempting to start `ora.maa01.ASM1.asm` on member `maa01`
2013-08-09 17:18:57.531: [ CRSRES][1490757952] startRunnable: setting CLI values
2013-08-09 17:18:57.541: [ CRSRES][1490757952] Attempting to start `ora.maa01.vip` on member `maa01`
2013-08-09 17:19:01.054: [ CRSRES][1490757952] Start of `ora.maa01.vip` on member `maa01` succeeded.
2013-08-09 17:19:01.079: [ CRSRES][1490757952] startRunnable: setting CLI values
2013-08-09 17:19:01.093: [ CRSRES][1490757952] Attempting to start `ora.maa01.LISTENER_MAA01.lsnr` on member `maa01`
2013-08-09 17:19:04.660: [ CRSRES][1490757952] Start of `ora.maa01.LISTENER_MAA01.lsnr` on member `maa01` succeeded.
2013-08-09 17:19:05.204: [ CRSRES][1513838912] CRS-1002: Resource 'ora.maa01.LISTENER_MAA01.lsnr' is already running on member 'maa01'

2013-08-09 17:28:31.192: [ OCRMAS][1213802816]th_master:13: I AM THE NEW OCR MASTER at incar 14. Node Number 1 <---Node 1 is master.
2013-08-09 17:28:31.194: [ CRSCOMM][1486555456] CLEANUP: Searching for connections to failed node joadbtest02
2013-08-09 17:28:31.194: [ CRSEVT][1486555456] Processing member leave for joadbtest02, incarnation: 244117407
2013-08-09 17:28:31.195: [ CRSD][1486555456] SM: recovery in process: 8
2013-08-09 17:28:31.195: [ CRSEVT][1486555456] Do failover for: joadbtest02 <-------在此时failover失败.

2013-08-09 17:28:31.399: [ CRSRES][1513838912] startRunnable: setting CLI values
2013-08-09 17:28:31.414: [ CRSRES][1513838912] Attempting to start `ora.joadbtest02.vip` on member `maa01`  <---尝试vip failover到节点1
2013-08-09 17:28:31.421: [ CRSRES][1530632512] startRunnable: setting CLI values
2013-08-09 17:28:31.434: [ CRSRES][1530632512] Attempting to start `ora.rac.db` on member `maa01`
2013-08-09 17:28:31.542: [ CRSRES][1530632512] Start of `ora.rac.db` on member `maa01` succeeded.
2013-08-09 17:28:37.863: [ CRSAPP][1513838912] StartResource error for ora.joadbtest02.vip error code = 1
2013-08-09 17:28:41.057: [ CRSRES][1513838912] Start of `ora.joadbtest02.vip` on member `maa01` failed. <---------VIP failover failed.
2013-08-09 17:28:41.085: [ CRSEVT][1486555456] Post recovery done evmd event for: joadbtest02
2013-08-09 17:28:41.085: [ CRSD][1486555456] SM: recoveryDone: 0
2013-08-09 17:28:41.098: [ CRSEVT][1486555456] Processing RecoveryDone


再查看ora.joadbtest02.vip日志文件:
ora.joadbtest02.vip:
2013-08-09 17:28:34.723: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: checkIf: interface eth2 is down  <--- is it clue?
Invalid parameters, or failed to bring up VIP (host=MAA01)

2013-08-09 17:28:34.729: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: cmd = /oracle/app/11gR1/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /oracle/app/11gR1/crs/bin/racgvip start joadbtest02

2013-08-09 17:28:34.729: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: rc = 1, time = 3.150s

2013-08-09 17:28:37.861: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: cmd = /oracle/app/11gR1/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /oracle/app/11gR1/crs/bin/racgvip check joadbtest02

2013-08-09 17:28:37.861: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: rc = 1, time = 3.130s

2013-08-09 17:28:37.861: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: end for resource = ora.joadbtest02.vip, action = start, status = 1, time = 6.350s

此处已经看出线索了,看来问题出在网卡这里,节点1的Public IP的网卡是eth0,不知道何故,节点二Public IP的网卡却为eth2,
由于客户之前的messages日志并没有保留,Oracle和集群更早期的日志也没有。具体为什么两个节点的Public IP不一样不得而知。

解决方法:
将两个节点Public IP的网卡设置为一致,具体操作可参考我之前写的一篇文章:
VIP不能正常启动,报错CRS-1006
http://blog.csdn.net/zhou1862324/article/details/17268339


RAC节点1reboot之后,节点1的资源为何没有failover到节点2?

节点1reboot之后,节点1的资源为何没有failover到节点2? 现象: 客户咨询了一个问题,即在节点1的reboot过程中,通过监控,始终没有发现节点1的资源failover到了节点2,如下:...

hadoop配置好之后启服务,jps能看到datanode进程,可是后台的datanode日志有如下错误,且50070端口上也是没有活的节点

2015-04-22 14:17:29,908 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode master/19...

集群单节点任务Failover

集群单节点任务以及失败转移的实现

SAP HANA 高可用性 (High Availability) 解决方案 - Host Auto-Failover, 节点失效自动切换

SAP HANA完全支持系统高可用性的要求,对从简单的单点故障的自动恢复到严重的数据中心灾难恢复,都有完整的解决方案。 下面我们先谈谈对使用影响相对较小的故障恢复,SAP HANA支持三种解决方...

未配置默认网关导致RAC数据库VIP启动失败,节点重启

【问题描述】 数据库服务器迁移到另外一个实验室后,发现RAC数据库启动异常,有如下现象:  1. 刚开始的时候,数据库可以启动且所有CRS资源状态正常。但是过一段时间,节点2就会自动重启。  查看...

11g RAC 加节点 之 手动添加vip 资源

今天在给一套2节点rac 添加一个节点3时碰到几个问题; 1.原生产rac 环境私网网卡,没有使用多张冗余网卡,为保证gi 稳定性,禁用了haip;     but ,埋下了一个不是坑的坑!!!!!!...
  • royjj
  • royjj
  • 2015年07月23日 15:41
  • 1269

Hadoop - 更换节点ip 地址之后(虚拟机中的伪分布模式,学习format)

1、更改网卡的 ip 地址 2、更改 /etc/hosts 中的 master 对应的 ip (这个一定要写对啊) 3、hadoop namenode -format之前的准备阶段(删除一些文...

11G RAC 一节点宕机后修改监听相关配置使通过宕机节点VIP连接数据库的客户端可以连接

11.2.0.4 RAC,一个节点宕机,此时VIP FAILOVER到了另一节点。 此时存在大量客户端连接,客户端使用VIP连接到数据库服务器; 且一半客户端为连接节点1 VIP,另一半客户端为连...

主机硬件问题导致rac节点重启

昨晚,rac节点重启,虽未影响应用,但需查明原因 1,查看数据库日志alert.log,显示数据库直接重启,重启之前没有任何日志 2012-11-11 06:00:00.091000 +0...
  • hijk139
  • hijk139
  • 2012年11月14日 15:25
  • 5028

玩转linux主机--fastDFS单节点安装

配置环境centOS7 vmware 12 fastdfs-master.zip libfastcommonlibfastcommon-master.zip fastdfs-nginx-module-...
  • SY_Yu
  • SY_Yu
  • 2016年06月16日 20:39
  • 532
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:节点2主机关停之后,VIP并没有failover到节点一
举报原因:
原因补充:

(最多只允许输入30个字)