关闭

节点2主机关停之后,VIP并没有failover到节点一

标签: databaseoracleracvip宕机
1206人阅读 评论(1) 收藏 举报
分类:

现象:

节点2主机关停之后,VIP并没有failover到节点一

如下所示,在节点一查看,VIP并没有FAILOVER过来。
[root@MAA01 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr A4:BA:DB:13:E2:AB
inet addr:10.8.32.111 Bcast:10.0.15.255 Mask:255.255.255.0
 inet6 addr: fe80::a6ba:dbff:fe13:e2ab/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:12896778 errors:0 dropped:0 overruns:0 frame:0
 TX packets:9488933 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
RX bytes:2875695560 (2.6 GiB) TX bytes:2411913446 (2.2 GiB)
 Interrupt:114 Memory:d6000000-d6012800

eth0:1 Link encap:Ethernet HWaddr A4:BA:DB:13:E2:AB
inet addr:10.8.32.115 Bcast:10.0.15.255 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 Interrupt:114 Memory:d6000000-d6012800

eth1 Link encap:Ethernet HWaddr A4:BA:DB:13:E2:AD
inet addr:192.168.127.101 Bcast:192.168.127.255 Mask:255.255.255.0
 inet6 addr: fe80::a6ba:dbff:fe13:e2ad/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:203865 errors:0 dropped:0 overruns:0 frame:0
 TX packets:309076 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
RX bytes:217218808 (207.1 MiB) TX bytes:66031839 (62.9 MiB)
 Interrupt:122 Memory:d8000000-d8012800

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
 inet6 addr: ::1/128 Scope:Host
 UP LOOPBACK RUNNING MTU:16436 Metric:1
 RX packets:2264776 errors:0 dropped:0 overruns:0 frame:0
 TX packets:2264776 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
RX bytes:3249461652 (3.0 GiB) TX bytes:3249461652 (3.0 GiB)

[oracle@MAA01 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....SM1.asm application ONLINE ONLINE maa01
ora....01.lsnr application ONLINE ONLINE maa01
ora....t01.gsd application ONLINE ONLINE maa01
ora....t01.ons application ONLINE ONLINE maa01
ora....t01.vip application ONLINE ONLINE maa01
ora....SM2.asm application ONLINE OFFLINE
ora....02.lsnr application ONLINE OFFLINE
ora....t02.gsd application ONLINE OFFLINE
ora....t02.ons application ONLINE OFFLINE
ora....t02.vip application ONLINE OFFLINE
ora.rac.db application ONLINE ONLINE maa01
ora....c1.inst application ONLINE ONLINE maa01
ora....c2.inst application ONLINE OFFLINE
ora...._taf.cs application OFFLINE OFFLINE
ora....ac1.srv application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
ora....rac1.cs application OFFLINE OFFLINE
ora....ac1.srv application OFFLINE OFFLINE
ora....rac2.cs application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
[oracle@MAA01 ~]$

此时,在节点1上ping节点2,无法ping通:[oracle@MAA01 ~]$ ping 10.8.32.112
PING 10.8.32.112 (10.8.32.112) 56(84) bytes of data.
From 10.8.32.111 icmp_seq=1 Destination Host Unreachable
From 10.8.32.111 icmp_seq=2 Destination Host Unreachable

分析:
查看了节点1的监听配置文件,未发现有异常:
$CRS_HOME/log/<nodename>/*.log
$CRS_HOME/log/<nodename>/crsd/*.log
$CRS_HOME/log/<nodename>/cssd/*.log
$ORACLE_HOME/network/admin/listener.ora

[oracle@MAA01 ~]$
[oracle@MAA01 ~]$ cd $ORACLE_HOME
[oracle@MAA01 db]$ cd network/admin/
[oracle@MAA01 admin]$ cat listener.ora
# listener.ora.maa01 Network Configuration File: /oracle/app/11gR1/db/network/admin/listener.ora.maa01
# Generated by Oracle configuration tools.

INBOUND_CONNECT_TIMEOUT_LISTENER_MAA01=180

LISTENER_MAA01 =
 (DESCRIPTION_LIST =
 (DESCRIPTION =
 (ADDRESS_LIST =
 (ADDRESS = (PROTOCOL = TCP)(HOST = MAA01-vip)(PORT = 1521)(IP = FIRST))
 )
 (ADDRESS_LIST =
 (ADDRESS = (PROTOCOL = TCP)(HOST = 10.8.32.111)(PORT = 1521)(IP = FIRST))
 )
 (ADDRESS_LIST =
 (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
 )
 )
 )

查看节点1的相关日志文件,发现尝CRS进行了failover vip的尝试,但失败了。
[oracle@MAA01 admin]$

crsd.log:

[crsd(5072)]CRS-1201:CRSD started on node maa01.
2013-08-09 17:18:57.513
[crsd(5072)]CRS-1205:Auto-start failed for the CRS resource . Details in maa01.
2013-08-09 17:28:01.175
[cssd(5555)]CRS-1612:node joadbtest02 (2) at 50% heartbeat fatal, eviction in 14.102 seconds
2013-08-09 17:28:02.177
[cssd(5555)]CRS-1612:node joadbtest02 (2) at 50% heartbeat fatal, eviction in 13.102 seconds
2013-08-09 17:28:09.181
[cssd(5555)]CRS-1611:node joadbtest02 (2) at 75% heartbeat fatal, eviction in 6.102 seconds
2013-08-09 17:28:13.179
[cssd(5555)]CRS-1610:node joadbtest02 (2) at 90% heartbeat fatal, eviction in 2.102 seconds
2013-08-09 17:28:14.181
[cssd(5555)]CRS-1610:node joadbtest02 (2) at 90% heartbeat fatal, eviction in 1.102 seconds
2013-08-09 17:28:15.183
[cssd(5555)]CRS-1610:node joadbtest02 (2) at 90% heartbeat fatal, eviction in 0.092 seconds <--------------heart beat loss
2013-08-09 17:28:16.045
[cssd(5555)]CRS-1607:CSSD evicting node joadbtest02. Details in /oracle/app/11gR1/crs/log/maa01/cssd/ocssd.log.
[cssd(5555)]CRS-1601:CSSD Reconfiguration complete. Active nodes are maa01 .        <----------------------------------------------------Node2 was evicted


alertmaa01.log:

[ CSSD]2013-08-09 17:28:31.188 [1158809920] >TRACE: clssnmUpdateNodeState: node 2, state (5/0) unique (1371182914/1371182914) prevConuni(1371182914) birth (244117402/244117402) (old/new)
[ CSSD]2013-08-09 17:28:31.188 [1158809920] >TRACE: clssnmDeactivateNode: node 2 (joadbtest02) left cluster

ocssd.log:

2013-08-09 17:18:57.506: [ CRSRES][1488656704] startRunnable: setting CLI values
2013-08-09 17:18:57.512: [ CRSRES][1486555456] maa01 : CRS-1019: Resource ora.joadbtest02.LISTENER_JOADBTEST02.lsnr (application) cannot run on maa01


2013-08-09 17:18:57.519: [ CRSRES][1488656704] Attempting to start `ora.maa01.ASM1.asm` on member `maa01`
2013-08-09 17:18:57.531: [ CRSRES][1490757952] startRunnable: setting CLI values
2013-08-09 17:18:57.541: [ CRSRES][1490757952] Attempting to start `ora.maa01.vip` on member `maa01`
2013-08-09 17:19:01.054: [ CRSRES][1490757952] Start of `ora.maa01.vip` on member `maa01` succeeded.
2013-08-09 17:19:01.079: [ CRSRES][1490757952] startRunnable: setting CLI values
2013-08-09 17:19:01.093: [ CRSRES][1490757952] Attempting to start `ora.maa01.LISTENER_MAA01.lsnr` on member `maa01`
2013-08-09 17:19:04.660: [ CRSRES][1490757952] Start of `ora.maa01.LISTENER_MAA01.lsnr` on member `maa01` succeeded.
2013-08-09 17:19:05.204: [ CRSRES][1513838912] CRS-1002: Resource 'ora.maa01.LISTENER_MAA01.lsnr' is already running on member 'maa01'

2013-08-09 17:28:31.192: [ OCRMAS][1213802816]th_master:13: I AM THE NEW OCR MASTER at incar 14. Node Number 1 <---Node 1 is master.
2013-08-09 17:28:31.194: [ CRSCOMM][1486555456] CLEANUP: Searching for connections to failed node joadbtest02
2013-08-09 17:28:31.194: [ CRSEVT][1486555456] Processing member leave for joadbtest02, incarnation: 244117407
2013-08-09 17:28:31.195: [ CRSD][1486555456] SM: recovery in process: 8
2013-08-09 17:28:31.195: [ CRSEVT][1486555456] Do failover for: joadbtest02 <-------在此时failover失败.

2013-08-09 17:28:31.399: [ CRSRES][1513838912] startRunnable: setting CLI values
2013-08-09 17:28:31.414: [ CRSRES][1513838912] Attempting to start `ora.joadbtest02.vip` on member `maa01`  <---尝试vip failover到节点1
2013-08-09 17:28:31.421: [ CRSRES][1530632512] startRunnable: setting CLI values
2013-08-09 17:28:31.434: [ CRSRES][1530632512] Attempting to start `ora.rac.db` on member `maa01`
2013-08-09 17:28:31.542: [ CRSRES][1530632512] Start of `ora.rac.db` on member `maa01` succeeded.
2013-08-09 17:28:37.863: [ CRSAPP][1513838912] StartResource error for ora.joadbtest02.vip error code = 1
2013-08-09 17:28:41.057: [ CRSRES][1513838912] Start of `ora.joadbtest02.vip` on member `maa01` failed. <---------VIP failover failed.
2013-08-09 17:28:41.085: [ CRSEVT][1486555456] Post recovery done evmd event for: joadbtest02
2013-08-09 17:28:41.085: [ CRSD][1486555456] SM: recoveryDone: 0
2013-08-09 17:28:41.098: [ CRSEVT][1486555456] Processing RecoveryDone


再查看ora.joadbtest02.vip日志文件:
ora.joadbtest02.vip:
2013-08-09 17:28:34.723: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: checkIf: interface eth2 is down  <--- is it clue?
Invalid parameters, or failed to bring up VIP (host=MAA01)

2013-08-09 17:28:34.729: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: cmd = /oracle/app/11gR1/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /oracle/app/11gR1/crs/bin/racgvip start joadbtest02

2013-08-09 17:28:34.729: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: rc = 1, time = 3.150s

2013-08-09 17:28:37.861: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: cmd = /oracle/app/11gR1/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /oracle/app/11gR1/crs/bin/racgvip check joadbtest02

2013-08-09 17:28:37.861: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: clsrcexecut: rc = 1, time = 3.130s

2013-08-09 17:28:37.861: [    RACG][1353934704] [11316][1353934704][ora.joadbtest02.vip]: end for resource = ora.joadbtest02.vip, action = start, status = 1, time = 6.350s

此处已经看出线索了,看来问题出在网卡这里,节点1的Public IP的网卡是eth0,不知道何故,节点二Public IP的网卡却为eth2,
由于客户之前的messages日志并没有保留,Oracle和集群更早期的日志也没有。具体为什么两个节点的Public IP不一样不得而知。

解决方法:
将两个节点Public IP的网卡设置为一致,具体操作可参考我之前写的一篇文章:
VIP不能正常启动,报错CRS-1006
http://blog.csdn.net/zhou1862324/article/details/17268339


1
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:152707次
    • 积分:2682
    • 等级:
    • 排名:第14123名
    • 原创:111篇
    • 转载:14篇
    • 译文:1篇
    • 评论:16条
    文章分类
    最新评论