MHA 一个slave宕机的影响

最新推荐文章于 2021-06-13 21:18:26 发布

Fan_-_

最新推荐文章于 2021-06-13 21:18:26 发布

阅读量817

点赞数 1

分类专栏： MHA MySQL

本文链接：https://blog.csdn.net/ashic/article/details/104595988

版权

MySQL 同时被 2 个专栏收录

109 篇文章 5 订阅

订阅专栏

MHA

6 篇文章 1 订阅

订阅专栏

文章目录

环境说明

IP	角色	备注	mha4mysql-node	mha4mysql-manager
192.168.98.11	master	读写	√
192.168.98.10	slave	只读	√
192.168.98.12	slave	只读	√
192.168.98.13	manager节点	N/A	√	√

运行前有节点宕机

手动关闭一个从库192.168.98.10mysqld后尝试启动masterha_manager

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf

启动失败, 日志中有如下信息

Fri Feb 28 14:47:58 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 14:47:59 2020 - [info] GTID failover mode = 1
Fri Feb 28 14:47:59 2020 - [info] Dead Servers:
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 14:47:59 2020 - [info] Alive Servers:
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 14:47:59 2020 - [info] Alive Slaves:
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 14:47:59 2020 - [info]     GTID ON
Fri Feb 28 14:47:59 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 14:47:59 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 14:47:59 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 14:47:59 2020 - [info] Checking slave configurations..
Fri Feb 28 14:47:59 2020 - [info] Checking replication filtering settings..
Fri Feb 28 14:47:59 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 14:47:59 2020 - [info]  Replication filtering check ok.
Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnf
Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_manager line 50.
Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 14:47:59 2020 - [info] Got exit code 1 (Not master dead).

应该先使用masterha_check_repl检查复制状态

#masterha_check_repl --conf=/etc/masterha/conf/cls_all.cnf --global_conf=/etc/masterha/conf/masterha_default.cnf
Fri Feb 28 15:27:24 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 15:27:24 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:27:24 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:27:24 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:27:25 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:27:25 2020 - [info] Dead Servers:
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:27:25 2020 - [info] Alive Servers:
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:27:25 2020 - [info] Alive Slaves:
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:27:25 2020 - [info]     GTID ON
Fri Feb 28 15:27:25 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:27:25 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:27:25 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:27:25 2020 - [info] Checking slave configurations..
Fri Feb 28 15:27:25 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:27:25 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:27:25 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnf
Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_check_repl line 48.
Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 15:27:25 2020 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

在文档https://github.com/yoshinorim/mha4mysql-manager/wiki/masterha_manager中:

--ignore_fail_on_start

By default, master monitoring (not failover) process stops if one or more slaves are down, regardless of “ignore_fail” parameter setting. By setting --ignore_fail_on_start, master monitoring does not stop if ignore_fail marked slaves are down.

默认情况下，如果一个或多个从库宕机，则不管“ ignore_fail”参数设置如何，主服务器监视（非故障转移）过程都会停止。通过设置–ignore_fail_on_start，如果标记为ignore_fail的从属服务器已关闭，则主监视不会停止。

这个意思就是说如果在配置文件中设置了为10设置了ignore_fail=1, 那么再加上--ignore_fail_on_start可以启动masterha_manager, 否则如果不在配置文件中指定ignore_fail=1即使指定了--ignore_fail_on_start也是不能启动的

加上ignore_fail=1

#cat /etc/masterha/conf/cls_all.cnf 
[server default]
#workdir on the management server
manager_workdir=/masterha/cls_all/
manager_log=/masterha/cls_all/manager.log

#workdir on the node for mysql server
master_binlog_dir=/data/mysql_3306/data/

#自动故障VIP切换调用脚本
master_ip_failover_script=/etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100

#手动故障切换调用脚本
master_ip_online_change_script=/etc/masterha/scripts/master_ip_online_change_vip --vip=192.168.98.100

#检测master的可用性
secondary_check_script=masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12

[server1]
hostname=192.168.98.10
candidate_master=1
ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
# no_master=1

启动成功

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_start

Fri Feb 28 15:59:37 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:59:38 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:59:38 2020 - [info] Dead Servers:
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:59:38 2020 - [info] Alive Servers:
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:59:38 2020 - [info] Alive Slaves:
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:59:38 2020 - [info]     GTID ON
Fri Feb 28 15:59:38 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:59:38 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:59:38 2020 - [info] Checking slave configurations..
Fri Feb 28 15:59:38 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:59:38 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:59:38 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:59:38 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Feb 28 15:59:38 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Feb 28 15:59:39 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.
Fri Feb 28 15:59:39 2020 - [info] 
192.168.98.11(192.168.98.11:3306) (current master)
 +--192.168.98.12(192.168.98.12:3306)

Fri Feb 28 15:59:39 2020 - [info] Checking master_ip_failover_script status:
Fri Feb 28 15:59:39 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --command=status --ssh_user=root --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 
Fri Feb 28 15:59:39 2020 - [info]  OK.
Fri Feb 28 15:59:39 2020 - [warning] shutdown_script is not defined.
Fri Feb 28 15:59:39 2020 - [info] Set master ping interval 3 seconds.
Fri Feb 28 15:59:39 2020 - [info] Set secondary check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12
Fri Feb 28 15:59:39 2020 - [info] Starting ping health check on 192.168.98.11(192.168.98.11:3306)..
Fri Feb 28 15:59:39 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

不加

#cat /etc/masterha/conf/cls_all.cnf 
...
[server1]
hostname=192.168.98.10
candidate_master=1
# ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
# no_master=1

启动失败

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_start

Fri Feb 28 15:58:57 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:58:58 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:58:58 2020 - [info] Dead Servers:
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:58:58 2020 - [info] Alive Servers:
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:58:58 2020 - [info] Alive Slaves:
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:58:58 2020 - [info]     GTID ON
Fri Feb 28 15:58:58 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:58:58 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:58:58 2020 - [info] Checking slave configurations..
Fri Feb 28 15:58:58 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:58:58 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:58:58 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:58:58 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492]  Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings.
Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 402.
Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 15:58:58 2020 - [info] Got exit code 1 (Not master dead).

另外如果加了ignore_fail=1 但是仅仅剩下的一个12指定了no_master=1的话也无法启动

#cat /etc/masterha/conf/cls_all.cnf 
...
[server1]
hostname=192.168.98.10
candidate_master=1
ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
no_master=1

None of slaves can be master

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_start


Fri Feb 28 15:55:14 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:55:16 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:55:16 2020 - [info] Dead Servers:
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:55:16 2020 - [info] Alive Servers:
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:55:16 2020 - [info] Alive Slaves:
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:55:16 2020 - [info]     GTID ON
Fri Feb 28 15:55:16 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:55:16 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:55:16 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:55:16 2020 - [info] Checking slave configurations..
Fri Feb 28 15:55:16 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:55:16 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:55:16 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnf
Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_manager line 50.
Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 15:55:16 2020 - [info] Got exit code 1 (Not master dead).

运行中有点节点宕机

如果masterha_manager运行中一个从库宕机, masterha_manager貌似无感知, 因为masterha_manager进程没有退出, 日志也没有报错

check_status仍然是正常的

#masterha_check_status --conf=/etc/masterha/conf/cls_all.cnf --global_conf=/etc/masterha/conf/masterha_default.cnf
cls_all (pid:88464) is running(0:PING_OK), master:192.168.98.11

但是手动切换会失败

#/usr/local/bin/masterha_master_switch --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --master_state=alive --new_master_host=192.168.98.12 --new_master_port=3306 --orig_master_is_new_slave --interactive=0
Fri Feb 28 15:33:34 2020 - [info] MHA::MasterRotate version 0.58.
Fri Feb 28 15:33:34 2020 - [info] Starting online master switch..
Fri Feb 28 15:33:34 2020 - [info] 
Fri Feb 28 15:33:34 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Feb 28 15:33:34 2020 - [info] 
Fri Feb 28 15:33:34 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 15:33:34 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:33:34 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:33:35 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:33:35 2020 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln94] Switching master should not be started if one or more servers is down.
Fri Feb 28 15:33:35 2020 - [info] Dead Servers:
Fri Feb 28 15:33:35 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:33:35 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53.

Dead Servers:会列出有问题的Server

如果在10还没修复时Master11挂了, 同时12设置了no_master, 自动failover会失败, 因为没有新的master可以用

#cat /etc/masterha/conf/cls_all.cnf 
...
[server1]
hostname=192.168.98.10
candidate_master=1
ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
no_master=1

关闭11

Fri Feb 28 15:35:38 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.98.11;port=3306;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '192.168.98.11' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:38 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12  --user=root  --master_host=192.168.98.11  --master_ip=192.168.98.11  --master_port=3306 --master_user=mha --master_password=mha --ping_type=CONNECT
Fri Feb 28 15:35:38 2020 - [info] Executing SSH check script: exit 0
Fri Feb 28 15:35:39 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.
Monitoring server 192.168.98.11 is reachable, Master is not reachable from 192.168.98.11. OK.
Monitoring server 192.168.98.12 is reachable, Master is not reachable from 192.168.98.12. OK.
Fri Feb 28 15:35:40 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Feb 28 15:35:41 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:41 2020 - [warning] Connection failed 2 time(s)..
Fri Feb 28 15:35:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:44 2020 - [warning] Connection failed 3 time(s)..
Fri Feb 28 15:35:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:47 2020 - [warning] Connection failed 4 time(s)..
Fri Feb 28 15:35:47 2020 - [warning] Master is not reachable from health checker!
Fri Feb 28 15:35:47 2020 - [warning] Master 192.168.98.11(192.168.98.11:3306) is not reachable!
Fri Feb 28 15:35:47 2020 - [warning] SSH is reachable.
Fri Feb 28 15:35:47 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_all.cnf again, and trying to connect to all servers to check server status..
Fri Feb 28 15:35:47 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 15:35:47 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:35:47 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:35:48 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:35:48 2020 - [info] Dead Servers:
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:48 2020 - [info] Alive Servers:
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:35:48 2020 - [info] Alive Slaves:
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:35:48 2020 - [info]     GTID ON
Fri Feb 28 15:35:48 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:48 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:35:48 2020 - [info] Checking slave configurations..
Fri Feb 28 15:35:48 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:35:48 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:35:48 2020 - [info] Master is down!
Fri Feb 28 15:35:48 2020 - [info] Terminating monitoring script.
Fri Feb 28 15:35:48 2020 - [info] Got exit code 20 (Master dead).
Fri Feb 28 15:35:48 2020 - [info] MHA::MasterFailover version 0.58.
Fri Feb 28 15:35:48 2020 - [info] Starting master failover.
Fri Feb 28 15:35:48 2020 - [info] 
Fri Feb 28 15:35:48 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Feb 28 15:35:48 2020 - [info] 
Fri Feb 28 15:35:49 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:35:49 2020 - [info] Dead Servers:
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:49 2020 - [info] Checking master reachability via MySQL(double check)...
Fri Feb 28 15:35:49 2020 - [info]  ok.
Fri Feb 28 15:35:49 2020 - [info] Alive Servers:
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:35:49 2020 - [info] Alive Slaves:
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:35:49 2020 - [info]     GTID ON
Fri Feb 28 15:35:49 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:49 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:35:49 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492]  Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings.
Fri Feb 28 15:35:49 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/share/perl5/MHA/MasterFailover.pm line 269.

主要问题在

Not candidate for the new Master (no_master is set)

Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings

vip还正在原Master11上

root@localhost 14:40:38 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:98:28:0b brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.11/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.98.100/24 scope global secondary ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::cd5b:e71c:7a67:b391/64 scope link 
       valid_lft forever preferred_lft forever
root@localhost 15:35:04 [(none)]> shutdown;
Query OK, 0 rows affected (0.00 sec)

root@localhost 15:35:37 [(none)]> 2020-02-28T07:35:50.083534Z mysqld_safe mysqld from pid file /data/mysql_3306/run/mysql.pid ended

root@localhost 15:36:40 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:98:28:0b brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.11/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.98.100/24 scope global secondary ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::cd5b:e71c:7a67:b391/64 scope link 
       valid_lft forever preferred_lft forever

12仍然是从库, 且没有vip

root@localhost 15:35:32 [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: 192.168.98.11
                  Master_User: repler
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 2496
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 1354
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2496
              Relay_Log_Space: 1561
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'repler@192.168.98.11:3306' - retry-time: 60  retries: 1
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 98113306
                  Master_UUID: 68703597-592c-11ea-88b3-000c2998280b
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 200228 15:35:45
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 68703597-592c-11ea-88b3-000c2998280b:1-4
            Executed_Gtid_Set: 3a60f8c7-592c-11ea-8cb1-000c2973aaf0:1-6,
68703597-592c-11ea-88b3-000c2998280b:1-4
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

root@localhost 15:36:32 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:96:c2:3a brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.12/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::ef03:3251:b4ed:204c/64 scope link 
       valid_lft forever preferred_lft forever
root@localhost 15:36:37 [(none)]>

如果有候选master, 也就是12没有加no_master=1是可以自动failover的

Fri Feb 28 16:16:27 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.98.11;port=3306;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '192.168.98.11' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:27 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12  --user=root  --master_host=192.168.98.11  --master_ip=192.168.98.11  --master_port=3306 --master_user=mha --master_password=mha --ping_type=CONNECT
Fri Feb 28 16:16:27 2020 - [info] Executing SSH check script: exit 0
Fri Feb 28 16:16:28 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.
Monitoring server 192.168.98.11 is reachable, Master is not reachable from 192.168.98.11. OK.
Monitoring server 192.168.98.12 is reachable, Master is not reachable from 192.168.98.12. OK.
Fri Feb 28 16:16:28 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Feb 28 16:16:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:30 2020 - [warning] Connection failed 2 time(s)..
Fri Feb 28 16:16:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:33 2020 - [warning] Connection failed 3 time(s)..
Fri Feb 28 16:16:36 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:36 2020 - [warning] Connection failed 4 time(s)..
Fri Feb 28 16:16:36 2020 - [warning] Master is not reachable from health checker!
Fri Feb 28 16:16:36 2020 - [warning] Master 192.168.98.11(192.168.98.11:3306) is not reachable!
Fri Feb 28 16:16:36 2020 - [warning] SSH is reachable.
Fri Feb 28 16:16:36 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_all.cnf again, and trying to connect to all servers to check server status..
Fri Feb 28 16:16:36 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 16:16:36 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 16:16:36 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 16:16:37 2020 - [info] GTID failover mode = 1
Fri Feb 28 16:16:37 2020 - [info] Dead Servers:
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:37 2020 - [info] Alive Servers:
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 16:16:37 2020 - [info] Alive Slaves:
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:37 2020 - [info]     GTID ON
Fri Feb 28 16:16:37 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:37 2020 - [info] Checking slave configurations..
Fri Feb 28 16:16:37 2020 - [info] Checking replication filtering settings..
Fri Feb 28 16:16:37 2020 - [info]  Replication filtering check ok.
Fri Feb 28 16:16:37 2020 - [info] Master is down!
Fri Feb 28 16:16:37 2020 - [info] Terminating monitoring script.
Fri Feb 28 16:16:37 2020 - [info] Got exit code 20 (Master dead).
Fri Feb 28 16:16:37 2020 - [info] MHA::MasterFailover version 0.58.
Fri Feb 28 16:16:37 2020 - [info] Starting master failover.
Fri Feb 28 16:16:37 2020 - [info] 
Fri Feb 28 16:16:37 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Feb 28 16:16:37 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] GTID failover mode = 1
Fri Feb 28 16:16:38 2020 - [info] Dead Servers:
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:38 2020 - [info] Checking master reachability via MySQL(double check)...
Fri Feb 28 16:16:38 2020 - [info]  ok.
Fri Feb 28 16:16:38 2020 - [info] Alive Servers:
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 16:16:38 2020 - [info] Alive Slaves:
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:38 2020 - [info]     GTID ON
Fri Feb 28 16:16:38 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:38 2020 - [info] Starting GTID based failover.
Fri Feb 28 16:16:38 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Feb 28 16:16:38 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Feb 28 16:16:38 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Feb 28 16:16:38 2020 - [info] Executing master IP deactivation script:
Fri Feb 28 16:16:38 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 --command=stopssh --ssh_user=root  
Disabling the VIP on old master: 192.168.98.11 
Fri Feb 28 16:16:39 2020 - [info]  done.
Fri Feb 28 16:16:39 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Feb 28 16:16:39 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:234
Fri Feb 28 16:16:39 2020 - [info] Retrieved Gtid Set: 68703597-592c-11ea-88b3-000c2998280b:1-4
Fri Feb 28 16:16:39 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Feb 28 16:16:39 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:39 2020 - [info]     GTID ON
Fri Feb 28 16:16:39 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:39 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:234
Fri Feb 28 16:16:39 2020 - [info] Retrieved Gtid Set: 68703597-592c-11ea-88b3-000c2998280b:1-4
Fri Feb 28 16:16:39 2020 - [info] Oldest slaves:
Fri Feb 28 16:16:39 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:39 2020 - [info]     GTID ON
Fri Feb 28 16:16:39 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] Searching new master from slaves..
Fri Feb 28 16:16:39 2020 - [info]  Candidate masters from the configuration file:
Fri Feb 28 16:16:39 2020 - [info]  Non-candidate masters:
Fri Feb 28 16:16:39 2020 - [info] New master is 192.168.98.12(192.168.98.12:3306)
Fri Feb 28 16:16:39 2020 - [info] Starting master failover..
Fri Feb 28 16:16:39 2020 - [info] 
From:
192.168.98.11(192.168.98.11:3306) (current master)
 +--192.168.98.12(192.168.98.12:3306)

To:
192.168.98.12(192.168.98.12:3306) (new master)
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info]  Waiting all logs to be applied.. 
Fri Feb 28 16:16:39 2020 - [info]   done.
Fri Feb 28 16:16:39 2020 - [info] Getting new master's binlog name and position..
Fri Feb 28 16:16:39 2020 - [info]  mysql-bin.000001:2496
Fri Feb 28 16:16:39 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.98.12', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Feb 28 16:16:39 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000001, 2496, 3a60f8c7-592c-11ea-8cb1-000c2973aaf0:1-6,
68703597-592c-11ea-88b3-000c2998280b:1-4
Fri Feb 28 16:16:39 2020 - [info] Executing master IP activate script:
Fri Feb 28 16:16:39 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --command=start --ssh_user=root --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 --new_master_host=192.168.98.12 --new_master_ip=192.168.98.12 --new_master_port=3306 --new_master_user='mha'   --new_master_password=xxx
Enabling the VIP - 192.168.98.100 on the new master - 192.168.98.12 
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Feb 28 16:16:39 2020 - [info]  OK.
Fri Feb 28 16:16:39 2020 - [info] ** Finished master recovery successfully.
Fri Feb 28 16:16:39 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] All new slave servers recovered successfully.
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 5: New master cleanup phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] Resetting slave info on the new master..
Fri Feb 28 16:16:39 2020 - [info]  192.168.98.12: Resetting slave info succeeded.
Fri Feb 28 16:16:39 2020 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln2045] Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.
Fri Feb 28 16:16:39 2020 - [info] 

----- Failover Report -----

cls_all: MySQL Master failover 192.168.98.11(192.168.98.11:3306) to 192.168.98.12(192.168.98.12:3306)

Master 192.168.98.11(192.168.98.11:3306) is down!

Check MHA Manager logs at localhost.localdomain:/masterha/cls_all/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.98.11(192.168.98.11:3306)
Selected 192.168.98.12(192.168.98.12:3306) as a new master.
192.168.98.12(192.168.98.12:3306): OK: Applying all logs succeeded.
192.168.98.12(192.168.98.12:3306): OK: Activated master IP address.
192.168.98.12(192.168.98.12:3306): Resetting slave info succeeded.
192.168.98.10(192.168.98.10:3306): ERROR: Could not be reachable so couldn't recover.
Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.
Fri Feb 28 16:16:39 2020 - [info] Sending mail..
sh: /etc/masterha/scripts/send_report: No such file or directory
Fri Feb 28 16:16:39 2020 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln2089] Failed to send mail with return code 127:0

只不过由于10无法连通, recover on slave partially failed

192.168.98.10(192.168.98.10:3306): ERROR: Could not be reachable so couldn't recover.
Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.

不过failover成功, vip已经到了12上

root@localhost 16:16:16 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:96:c2:3a brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.12/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.98.100/24 scope global secondary ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::ef03:3251:b4ed:204c/64 scope link 
       valid_lft forever preferred_lft forever
root@localhost 16:27:37 [(none)]> show slave status\G
Empty set (0.00 sec)

root@localhost 16:27:43 [(none)]> show global variables like '%read_only%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_read_only      | OFF   |
| read_only             | OFF   |
| super_read_only       | OFF   |
| transaction_read_only | OFF   |
| tx_read_only          | OFF   |
+-----------------------+-------+
5 rows in set (0.00 sec)