<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

 
1)         环境介绍

OS:Redhat Enterprise Linux 5.2 AP

Node1:heartbeat <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" />10.1.1.100  host IP:192.168.150.21 ipmi IP: 10.1.1.102

Node2:heartbeat 10.1.1.101  host IP:192.168.150.22 ipmi IP: 10.1.1.103

services IP: 192.168.150.25

2)         故障现象:

    群集正常启动后,在当前活动节点上执行 clusvcadm –r services –m node2 进行服务切换,服务切换后, node1 机器物理网卡 IP 丢失,网卡状态变为未激活; node2 浮动地址正常加载。该故障现象可稳定重现。

3)         分析测试:

   提供了系统 sysreport 相关文件给 redhat 800 后, 800 答复在他们的测试环境中无法重现故障,将问题反馈给美国 redhat 开发工程师。后答复为 rhcs 本身 bug 导致。 Bug NO. 453000
 
   查看/usr/share/cluster/ip.sh配置文件,使用真实IP地址替换参数变量,执行以下命令进行测试。发现,原ip.sh存在bug,当浮动IP地址为广播地址的子串时,将同时匹配到两条IP地址记录,cluster做IP removing动作时,取了第一条记录,导致我的物理网卡IP地址丢失。
   如果浮动IP地址设置为192.168.150.2也将出现同样的问题。funny!
测试如下:
 
[root@ngccintfB ~]# /sbin/ip addr list | grep "192.168.150.25" | head -n 2 | awk '{print $w}'

          inet 192.168.150.22/24 brd 192.168.150.255 scope global eth0

inet 192.168.150.25/24 scope global secondary eth0

[root@ngccintfB ~]# /sbin/ip addr list | grep "192.168.150.25/" | head -n 1 | awk '{print $w}'

    inet 192.168.150.25/24 scope global secondary eth0

4)         故障处理:

查看 bug 说明后,直接修改了 /usr/share/cluster/ip.sh 文件中行

  addr=`/sbin/ip addr list | grep "$addr" | head -n 1 | awk '{print $2}'`

  addr=`/sbin/ip addr list | grep "$addr / " | head -n 1 | awk '{print $2}'`
 
使用tail -f /var/log/message 观察群集服务切换时日志,192.168.150.25被正常removing。
Oct 26 12:45:22 ngccintfB clurgmgrd[13157]: <notice> Stopping service service:tomcat_services
Oct 26 12:45:23 ngccintfB clurgmgrd: [13157]: <info> unmounting /oradata2
Oct 26 12:45:23 ngccintfB clurgmgrd: [13157]: <info> Removing IPv4 address 192.168.150.25/24 from eth0
Oct 26 12:45:33 ngccintfB clurgmgrd[13157]: <notice> Service service:tomcat_services is stopped
至此,问题解决!
5)           参考材料

Problem description: On a cluster node where there is an rgmanager IP resource that is a substring match to the broadcast address of any interface, stopping or relocating that service will bring down that entire interface.  If this happens to be the interface used for cluster communication then the node misses its heartbeats and gets fenced.

Example: ashprdgfs01 has eth0 = 172.20.200.21/24.  This means the broadcast address is 172.20.200.255.  Stopping the following resource:

  <ip address="172.20.200.25" monitor_link="1"/>

causes the wrong ip address to be removed:

  Jun 23 15:18:27 ashprdgfs01 clurgmgrd[3889]: <notice> Stopping service service:VIP

  Jun 23 15:18:27 ashprdgfs01 clurgmgrd: [3889]: <info> Removing IPv4 address 172.20.200.21/24 from eth0

Since this is the main ip for eth0, this kills cluster communication and the node gets fenced:

  Jun 23 15:18:42 ashprdgfs02 kernel: dlm: closing connection to node 2

  Jun 23 15:18:42 ashprdgfs02 fenced[3199]: fencing node "ashprdgfs01.gspt.net"

The problem is on line 714 of ip.sh:

  addr=`/sbin/ip addr list | grep "$addr" | head -n 1 | awk '{print $2}'`

Because 172.20.200.25 is a substring of 172.20.200.255, this actually returns 2 lines:

   inet 172.20.200.21/24 brd 172.20.200.255 scope global eth0

   inet 172.20.200.25/24 scope global secondary eth0

and the first one is chosen mistakenly, causing it to be the one removed (718):

   /sbin/ip -f inet addr del dev $dev $addr

The address we want will always be follwed by a '/' for the subnet, so this can be fixed easily like so:

   addr=`/sbin/ip addr list | grep "$addr/" | head -n 1 | awk '{print $2}'`