mysql-mmm故障解决一例

关键字:FATAL Couldn't configure IP 'x.x.x.x' on interface 'eth1': undef

故障现象:
在mmm_monitor上ping agent的虚拟机ip,其中一个无法ping通
# mmm_control show                   
# Warning: agent on host db3 is not reachable
  db1(10.1.1.15) master/ONLINE. Roles: reader(10.1.1.23), writer(10.1.1.20)
  db2(10.1.1.14) master/ONLINE. Roles: reader(10.1.1.22)
  db3(10.1.1.13) slave/ONLINE. Roles: reader(10.1.1.21)
# Role writer is assigned to it's preferred host db1.

# ping 10.1.1.21
PING 10.1.1.21 (10.1.1.21) 56(84) bytes of data.
From 10.1.1.12 icmp_seq=2 Destination Host Unreachable
From 10.1.1.12 icmp_seq=3 Destination Host Unreachable
From 10.1.1.12 icmp_seq=4 Destination Host Unreachable

--- 10.1.1.21 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2998ms
, pipe 3
# ping 10.1.1.22
PING 10.1.1.22 (10.1.1.22) 56(84) bytes of data.
64 bytes from 10.1.1.22: icmp_seq=1 ttl=64 time=0.102 ms

--- 10.1.1.22 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.102/0.102/0.102/0.000 ms

在db3的实体机 10.1.1.13上:
查看是否有此IP,结果此IP没有被设置到此机器
# ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:80:3f:03:47:ce brd ff:ff:ff:ff:ff:ff
    inet 6.6.6.6/28 brd 122.225.32.143 scope global eth0
    inet6 fe80::280:3fff:fe03:47ce/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:80:3f:03:47:cf brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.13/24 brd 10.1.1.255 scope global eth1
    inet6 fe80::280:3fff:fe03:47cf/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
 

查看mysql-mmm-agent的日志
2011/06/02 20:07:50  INFO Changing active master to 'db1'
2011/06/02 20:07:50 FATAL Failed to change master to 'db1': undef
2011/06/02 20:07:50 FATAL Couldn't configure IP '10.1.1.21' on interface 'eth1': undef

根据mysql-mmm-agent的日志,通过google找到了解决问题的方法
# /usr/lib/mysql-mmm/agent/configure_ip eth1 10.1.1.21
Can't locate Net/ARP.pm in @INC (@INC contains: /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.8/i386-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /usr/lib/perl5/vendor_perl/5.8.8/MMM/Agent/Helpers/Network.pm line 11.
BEGIN failed--compilation aborted at /usr/lib/perl5/vendor_perl/5.8.8/MMM/Agent/Helpers/Network.pm line 11.
Compilation failed in require at /usr/lib/perl5/vendor_perl/5.8.8/MMM/Agent/Helpers/Actions.pm line 5.
BEGIN failed--compilation aborted at /usr/lib/perl5/vendor_perl/5.8.8/MMM/Agent/Helpers/Actions.pm line 5.
Compilation failed in require at /usr/lib/mysql-mmm/agent/configure_ip line 6.
BEGIN failed--compilation aborted at /usr/lib/mysql-mmm/agent/configure_ip line 6.
原来是arp.pm没有安装,我们现在就来安装它

# perl -MCPAN -e shell
cpan> install Net::ARP
安装完成以后通过mmm_monitor将db3置于离线,在置于在线,测试是否可以ping通。
# mmm_control set_offline db3
OK: State of 'db3' changed to ADMIN_OFFLINE. Now you can wait some time and check all roles!
# mmm_control set_online db3 
OK: State of 'db3' changed to ONLINE. Now you can wait some time and check its new roles!
# mmm_control show
  db1(10.1.1.15) master/ONLINE. Roles: reader(10.1.1.23), writer(10.1.1.20)
  db2(10.1.1.14) master/ONLINE. Roles: reader(10.1.1.22)
  db3(10.1.1.13) slave/ONLINE. Roles: reader(10.1.1.21)
# Role writer is assigned to it's preferred host db1.
# ping 10.1.1.21
PING 10.1.1.21 (10.1.1.21) 56(84) bytes of data.
64 bytes from 10.1.1.21: icmp_seq=1 ttl=64 time=0.181 ms
64 bytes from 10.1.1.21: icmp_seq=2 ttl=64 time=0.079 ms

问题解决了。
最后总结一下:
这个问题其实是安装时候不小心遗留下来的,由于db3是纯slave,所以一般是通过真实ip去访问,没有用到虚拟IP,mmm_monitor也完全没有表现出任何的故障信息。问题是在配置读写分离时候,用到了slave的虚拟IP,才发现的。

所以在需要上线的架构,最好还是安装官方文档,一一检查清楚,避免不必要的故障。