MHA Failover测试-上

6 篇文章 1 订阅

MHA Failover测试-上

TL;DR

用例ping_type=CONNECTping_type=INSERT
master too many connection不会触发failover不会触发failover
master hang不会触发failover会触发failover且成功
仅manager无法连通master不会触发failover不会触发failover
manager无法连通master, 且无法ssh slave1不会触发failover不会触发failover
manager无法连通master, 且无法ssh slave1和slave2不会触发failover不会触发failover
manager无法连通master, ssh到slave1后无法连通master不会触发failover不会触发failover
manager无法连通master, ssh到slave1和slave2后均无法连通master会触发failover且成功会触发failover且成功(长连接断开后才会)
master宕机前slave1也宕机了会触发failover, 但failover失败会触发failover, 但failover失败
master挂了, 在此之前slave-1 io_thread stop了会failover且成功会failover且成功
master挂了, 在此之前slave-1 io_thread error了会failover且成功会failover且成功
master挂了, 在此之前slave-1 sql_thread stop了会failover且成功会failover且成功
master挂了, 在此之前slave-1 sql_thread error了会触发failover, 但failover失败会触发failover, 但failover失败

环境信息

master: 172.16.120.10 centos-1 主 + proxysql
slave1: 172.16.120.11 centos-2 从 + proxysql
slave2: 172.16.120.12 centos-3 从 + proxysql
172.16.120.13 centos-4 mha manager

MHA配置

#cat /etc/masterha/conf/masterha_default.cnf 
[server default]
# mysql user and password,此处的密码不能加引号
user=mha
password=xxxx

#replication_user
repl_user=repler
repl_password=xxxx

#checking master every 3 second 
ping_interval=3

# 使用短连接检测,默认是长连接
ping_type=INSERT
#ping_type=CONNECT
#下面会测试两种type

#ssh user
ssh_user=root

#发送邮件脚本
report_script=/etc/masterha/scripts/send_report

# 节点工作目录
remote_workdir=/masterha/


#cat /etc/masterha/conf/cls_new.cnf
[server default]
#workdir on the management server
manager_workdir=/masterha/cls_new/
manager_log=/masterha/cls_new/manager.log

#workdir on the node for mysql server
master_binlog_dir=/data/mysql_3358/data/

#自动故障VIP切换调用脚本
master_ip_failover_script=/etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128

#手动故障切换调用脚本
master_ip_online_change_script=/etc/masterha/scripts/master_ip_online_change_vip --vip=172.16.120.128

#检测master的可用性
secondary_check_script=masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12

[server1]
hostname=172.16.120.10
port=3358
candidate_master=1

[server2]
hostname=172.16.120.11
port=3358
candidate_master=1

[server3]
hostname=172.16.120.12
port=3358
candidate_master=1

[用例测试] master too many connection

ping_type=CONNECT

root@localhost 11:43:29 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time   | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |      7 |                                                               | NULL             |         0 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |      4 |                                                               | NULL             |         0 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |      2 |                                                               | NULL             |         0 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |     14 |                                                               | NULL             |         0 |             0 |
| 1256 | repler   | 172.16.120.11:59594 | NULL               | Binlog Dump GTID | 952922 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 1257 | repler   | 172.16.120.12:56540 | NULL               | Binlog Dump GTID | 952902 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |      2 |                                                               | NULL             |         1 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |    120 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |     58 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |     17 |                                                               | NULL             |         1 |             0 |
| 1943 | root     | localhost           | dbms_monitor       | Query            |      0 | starting                                                      | show processlist |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
11 rows in set (0.00 sec)

root@localhost 11:43:30 [dbms_monitor]> show global variables like '%max_connec%';
+-----------------------+---------+
| Variable_name         | Value   |
+-----------------------+---------+
| extra_max_connections | 1       |
| max_connect_errors    | 1000000 |
| max_connections       | 1024    |
+-----------------------+---------+
3 rows in set (0.01 sec)

root@localhost 11:49:34 [dbms_monitor]> set global max_connections=5;
Query OK, 0 rows affected (0.01 sec);

结论: 不会failover

Fri Oct  9 11:42:57 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 11:42:57 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 11:42:57 2020 - [info]  OK.
Fri Oct  9 11:42:57 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 11:42:57 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 11:42:57 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 11:42:57 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 11:42:57 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 11:49:51 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Too many connections at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
1040 (Too many connections)
Fri Oct  9 11:49:51 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 11:49:51 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 11:49:51 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct  9 11:49:52 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 11:49:54 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:49:54 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:49:57 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:49:57 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:50:00 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:50:00 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..

ping_type=INSERT

root@localhost 11:55:13 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time   | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |      1 |                                                               | NULL             |         0 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |     18 |                                                               | NULL             |         1 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |     16 |                                                               | NULL             |         0 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |      8 |                                                               | NULL             |         0 |             0 |
| 1256 | repler   | 172.16.120.11:59594 | NULL               | Binlog Dump GTID | 953626 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 1257 | repler   | 172.16.120.12:56540 | NULL               | Binlog Dump GTID | 953606 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |      6 |                                                               | NULL             |         1 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |    103 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |     41 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |      1 |                                                               | NULL             |         1 |             0 |
| 1943 | root     | localhost           | dbms_monitor       | Query            |      0 | starting                                                      | show processlist |         0 |             0 |
| 2160 | mha      | 172.16.120.13:34660 | NULL               | Sleep            |      2 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 11:55:14 [dbms_monitor]> show global variables like '%max_connec%';
+-----------------------+---------+
| Variable_name         | Value   |
+-----------------------+---------+
| extra_max_connections | 1       |
| max_connect_errors    | 1000000 |
| max_connections       | 1024    |
+-----------------------+---------+
3 rows in set (0.04 sec)

root@localhost 11:55:19 [dbms_monitor]> set global max_connections=5;
Query OK, 0 rows affected (0.00 sec)

root@localhost 11:55:25 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time   | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |      6 |                                                               | NULL             |         0 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |      3 |                                                               | NULL             |         0 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |     31 |                                                               | NULL             |         0 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |      3 |                                                               | NULL             |         1 |             0 |
| 1256 | repler   | 172.16.120.11:59594 | NULL               | Binlog Dump GTID | 953641 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 1257 | repler   | 172.16.120.12:56540 | NULL               | Binlog Dump GTID | 953621 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |      0 |                                                               | NULL             |         0 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |    118 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |     56 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |      6 |                                                               | NULL             |         1 |             0 |
| 1943 | root     | localhost           | dbms_monitor       | Query            |      0 | starting                                                      | show processlist |         0 |             0 |
| 2160 | mha      | 172.16.120.13:34660 | NULL               | Sleep            |      2 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

ping_type=INSERT是长连接, 不会感知too many connection.

手动kill掉mha连接

root@localhost 11:55:29 [dbms_monitor]> kill 2160;
Query OK, 0 rows affected (0.01 sec)

结论: 不会failover

Fri Oct  9 11:54:48 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 11:54:48 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 11:54:48 2020 - [info]  OK.
Fri Oct  9 11:54:48 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 11:54:48 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 11:54:48 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 11:54:48 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 11:54:48 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 11:56:42 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct  9 11:56:42 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 11:56:42 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 11:56:43 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
ERROR 1040 (HY000): Too many connections
Monitoring server 172.16.120.11 is reachable, Master is not writable from 172.16.120.11. OK.
ERROR 1040 (HY000): Too many connections
Monitoring server 172.16.120.12 is reachable, Master is not writable from 172.16.120.12. OK.
Fri Oct  9 11:56:43 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct  9 11:56:45 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:56:45 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:56:48 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:56:48 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:56:51 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:56:51 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:56:54 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:56:54 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:56:57 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct  9 11:56:57 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct  9 11:57:00 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)

[用例测试] master hang

ping_type=CONNECT

master hang不好模拟, 这里间接模拟. 需要将ping_select的执行的select 1改为select innodb_table查询一个innodb表

sub ping_select($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh  = $self->{dbh};
  my ( $query, $sth, $href );
  eval {
    $dbh->{RaiseError} = 1;
    #$sth = $dbh->prepare("SELECT 1 As Value");
    $sth = $dbh->prepare("SELECT 1 As Value from infra.chk_masterha limit 1");

然后修改innodb_thread_concurrency

root@localhost 12:25:34 [dbms_monitor]> set global innodb_thread_concurrency=1;
Query OK, 0 rows affected (0.00 sec)

手动执行一个查询, 查询innodb表, 这样mha的select会被阻塞

root@localhost 12:25:45 [dbms_monitor]> select sleep(600) from infra.chk_masterha limit 1;



root@localhost 12:29:09 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+---------------------------------------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time   | State                                                         | Info                                              | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+---------------------------------------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |     16 |                                                               | NULL                                              |         1 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |      4 |                                                               | NULL                                              |         0 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |      1 |                                                               | NULL                                              |         0 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |      4 |                                                               | NULL                                              |         1 |             0 |
| 1256 | repler   | 172.16.120.11:59594 | NULL               | Binlog Dump GTID | 955662 | Master has sent all binlog to slave; waiting for more updates | NULL                                              |         0 |             0 |
| 1257 | repler   | 172.16.120.12:56540 | NULL               | Binlog Dump GTID | 955642 | Master has sent all binlog to slave; waiting for more updates | NULL                                              |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |     11 |                                                               | NULL                                              |         1 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |     96 |                                                               | NULL                                              |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |     34 |                                                               | NULL                                              |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |      6 |                                                               | NULL                                              |         0 |             0 |
| 1943 | root     | localhost           | dbms_monitor       | Query            |     21 | User sleep                                                    | select sleep(600) from infra.chk_masterha limit 1 |         0 |             0 |
| 2260 | root     | localhost           | dbms_monitor       | Query            |      0 | starting                                                      | show processlist                                  |         0 |             0 |
| 2303 | mha      | 172.16.120.13:34982 | NULL               | Query            |     20 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
| 2305 | mha      | 172.16.120.13:34988 | NULL               | Query            |     17 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
| 2308 | mha      | 172.16.120.13:34994 | NULL               | Query            |     14 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
| 2310 | mha      | 172.16.120.13:34998 | NULL               | Query            |     11 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
| 2312 | mha      | 172.16.120.13:35002 | NULL               | Query            |      8 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
| 2314 | mha      | 172.16.120.13:35006 | NULL               | Query            |      5 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
| 2317 | mha      | 172.16.120.13:35010 | NULL               | Query            |      2 | Sending data                                                  | SELECT 1 As Value from infra.chk_masterha limit 1 |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+---------------------------------------------------+-----------+---------------+
19 rows in set (0.00 sec)

结论: 不会failover, mha manager可能报错退出

Fri Oct  9 12:28:44 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 12:28:44 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 12:28:44 2020 - [info]  OK.
Fri Oct  9 12:28:44 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 12:28:44 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 12:28:44 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 12:28:44 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 12:28:44 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 12:28:53 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:28:53 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 12:28:53 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 12:28:53 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 12:28:53 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct  9 12:28:53 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 12:28:56 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:28:56 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 12:28:59 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:28:59 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

...

Fri Oct  9 12:30:47 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:30:47 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

手动ctrl+c终止select sleep(600) from infra.chk_masterha limit 1后, mha manager报错退出了

Fri Oct  9 12:30:49 2020 - [warning] Got error when monitoring master:  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 489.
Fri Oct  9 12:30:49 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln491] Target master's advisory lock is already held by someone. Please check whether you monitor the same master from multiple monitoring processes.
Fri Oct  9 12:30:49 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln511] Error happened on health checking.  at /usr/local/bin/masterha_manager line 50.
Fri Oct  9 12:30:49 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Oct  9 12:30:49 2020 - [info] Got exit code 1 (Not master dead).

ping_type=INSERT

master hang不好模拟, 这里间接模拟. 修改innodb_thread_concurrency

root@localhost 12:25:34 [dbms_monitor]> set global innodb_thread_concurrency=1;
Query OK, 0 rows affected (0.00 sec)

手动执行一个查询, 查询innodb表, 这样mha的insert会被阻塞

root@localhost 12:35:21 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time   | State                                                         | Info                                                                                                 | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |      3 |                                                               | NULL                                                                                                 |         1 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |      1 |                                                               | NULL                                                                                                 |         1 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |      8 |                                                               | NULL                                                                                                 |         0 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |      1 |                                                               | NULL                                                                                                 |         0 |             0 |
| 1256 | repler   | 172.16.120.11:59594 | NULL               | Binlog Dump GTID | 956039 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |         0 |             0 |
| 1257 | repler   | 172.16.120.12:56540 | NULL               | Binlog Dump GTID | 956019 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |     28 |                                                               | NULL                                                                                                 |         1 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |    113 |                                                               | NULL                                                                                                 |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |     51 |                                                               | NULL                                                                                                 |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |     13 |                                                               | NULL                                                                                                 |         0 |             0 |
| 1943 | root     | localhost           | dbms_monitor       | Query            |     15 | User sleep                                                    | select sleep(600) from infra.chk_masterha limit 1                                                    |         0 |             0 |
| 2260 | root     | localhost           | dbms_monitor       | Query            |      0 | starting                                                      | show processlist                                                                                     |         0 |             0 |
| 2395 | mha      | 172.16.120.13:35206 | NULL               | Query            |     13 | update                                                        | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam |         0 |             0 |
| 2398 | mha      | 172.16.120.13:35208 | NULL               | Query            |     11 | update                                                        | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam |         0 |             0 |
| 2400 | mha      | 172.16.120.11:32908 | NULL               | Query            |     10 | update                                                        | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam |         0 |             0 |
| 2401 | mha      | 172.16.120.13:35216 | NULL               | Query            |      8 | update                                                        | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam |         0 |             0 |
| 2403 | mha      | 172.16.120.12:58066 | NULL               | Query            |      7 | update                                                        | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam |         0 |             0 |
| 2404 | mha      | 172.16.120.13:35222 | NULL               | Query            |      5 | update                                                        | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
18 rows in set (0.00 sec)

结论: 会failover

Fri Oct  9 12:35:00 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 12:35:01 2020 - [info] GTID failover mode = 1
Fri Oct  9 12:35:01 2020 - [info] Dead Servers:
Fri Oct  9 12:35:01 2020 - [info] Alive Servers:
Fri Oct  9 12:35:01 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:01 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 12:35:01 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 12:35:01 2020 - [info] Alive Slaves:
Fri Oct  9 12:35:01 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:01 2020 - [info]     GTID ON
Fri Oct  9 12:35:01 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:01 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:01 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:01 2020 - [info]     GTID ON
Fri Oct  9 12:35:01 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:01 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:01 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:01 2020 - [info] Checking slave configurations..
Fri Oct  9 12:35:01 2020 - [info] Checking replication filtering settings..
Fri Oct  9 12:35:01 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 12:35:01 2020 - [info]  Replication filtering check ok.
Fri Oct  9 12:35:01 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 12:35:01 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 12:35:01 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 12:35:01 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 12:35:01 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 12:35:01 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 12:35:01 2020 - [info]  OK.
Fri Oct  9 12:35:01 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 12:35:01 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 12:35:01 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 12:35:01 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 12:35:01 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 12:35:16 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:35:16 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 12:35:16 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 12:35:17 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 12:35:19 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:35:19 2020 - [warning] Connection failed 2 time(s)..
Monitoring server 172.16.120.11 is reachable, Master is not writable from 172.16.120.11. OK.
Fri Oct  9 12:35:22 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:35:22 2020 - [warning] Connection failed 3 time(s)..
Monitoring server 172.16.120.12 is reachable, Master is not writable from 172.16.120.12. OK.
Fri Oct  9 12:35:23 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct  9 12:35:25 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct  9 12:35:25 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 12:35:25 2020 - [warning] Master is not reachable from health checker!
Fri Oct  9 12:35:25 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Fri Oct  9 12:35:25 2020 - [warning] SSH is reachable.
Fri Oct  9 12:35:25 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Fri Oct  9 12:35:25 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Oct  9 12:35:25 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct  9 12:35:25 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct  9 12:35:27 2020 - [info] GTID failover mode = 1
Fri Oct  9 12:35:27 2020 - [info] Dead Servers:
Fri Oct  9 12:35:27 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:27 2020 - [info] Alive Servers:
Fri Oct  9 12:35:27 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 12:35:27 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 12:35:27 2020 - [info] Alive Slaves:
Fri Oct  9 12:35:27 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:27 2020 - [info]     GTID ON
Fri Oct  9 12:35:27 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:27 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:27 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:27 2020 - [info]     GTID ON
Fri Oct  9 12:35:27 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:27 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:27 2020 - [info] Checking slave configurations..
Fri Oct  9 12:35:27 2020 - [info] Checking replication filtering settings..
Fri Oct  9 12:35:27 2020 - [info]  Replication filtering check ok.
Fri Oct  9 12:35:27 2020 - [info] Master is down!
Fri Oct  9 12:35:27 2020 - [info] Terminating monitoring script.
Fri Oct  9 12:35:27 2020 - [info] Got exit code 20 (Master dead).
Fri Oct  9 12:35:27 2020 - [info] MHA::MasterFailover version 0.58.
Fri Oct  9 12:35:27 2020 - [info] Starting master failover.
Fri Oct  9 12:35:27 2020 - [info] 
Fri Oct  9 12:35:27 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Oct  9 12:35:27 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] GTID failover mode = 1
Fri Oct  9 12:35:28 2020 - [info] Dead Servers:
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info] Alive Servers:
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 12:35:28 2020 - [info] Alive Slaves:
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info] Starting GTID based failover.
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct  9 12:35:28 2020 - [info] Executing master IP deactivation script:
Fri Oct  9 12:35:28 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root  
Disabling the VIP on old master: 172.16.120.10 
start down vipRTNETLINK answers: Cannot assign requested address
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1 
Fri Oct  9 12:35:28 2020 - [info]  done.
Fri Oct  9 12:35:28 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct  9 12:35:28 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000009:827239
Fri Oct  9 12:35:28 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:8822-10390
Fri Oct  9 12:35:28 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000009:827239
Fri Oct  9 12:35:28 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:8822-10390
Fri Oct  9 12:35:28 2020 - [info] Oldest slaves:
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] Searching new master from slaves..
Fri Oct  9 12:35:28 2020 - [info]  Candidate masters from the configuration file:
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 12:35:28 2020 - [info]     GTID ON
Fri Oct  9 12:35:28 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 12:35:28 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 12:35:28 2020 - [info]  Non-candidate masters:
Fri Oct  9 12:35:28 2020 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Fri Oct  9 12:35:28 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Fri Oct  9 12:35:28 2020 - [info] Starting master failover..
Fri Oct  9 12:35:28 2020 - [info] 
From:
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
 +--172.16.120.12(172.16.120.12:3358)
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info]  Waiting all logs to be applied.. 
Fri Oct  9 12:35:28 2020 - [info]   done.
Fri Oct  9 12:35:28 2020 - [info] Getting new master's binlog name and position..
Fri Oct  9 12:35:28 2020 - [info]  mysql-bin.000008:811243
Fri Oct  9 12:35:28 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Oct  9 12:35:28 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000008, 811243, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-10390,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Fri Oct  9 12:35:28 2020 - [info] Executing master IP activate script:
Fri Oct  9 12:35:28 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha'   --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11 
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0 
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Oct  9 12:35:28 2020 - [info]  OK.
Fri Oct  9 12:35:28 2020 - [info] ** Finished master recovery successfully.
Fri Oct  9 12:35:28 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Oct  9 12:35:28 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 44798. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201009123527.log if it takes time..
Fri Oct  9 12:35:29 2020 - [info] 
Fri Oct  9 12:35:29 2020 - [info] Log messages from 172.16.120.12 ...
Fri Oct  9 12:35:29 2020 - [info] 
Fri Oct  9 12:35:28 2020 - [info]  Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Fri Oct  9 12:35:28 2020 - [info]  Executed CHANGE MASTER.
Fri Oct  9 12:35:28 2020 - [info]  Slave started.
Fri Oct  9 12:35:28 2020 - [info]  gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-10390,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Fri Oct  9 12:35:29 2020 - [info] End of log messages from 172.16.120.12.
Fri Oct  9 12:35:29 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Fri Oct  9 12:35:29 2020 - [info] All new slave servers recovered successfully.
Fri Oct  9 12:35:29 2020 - [info] 
Fri Oct  9 12:35:29 2020 - [info] * Phase 5: New master cleanup phase..
Fri Oct  9 12:35:29 2020 - [info] 
Fri Oct  9 12:35:29 2020 - [info] Resetting slave info on the new master..
Fri Oct  9 12:35:29 2020 - [info]  172.16.120.11: Resetting slave info succeeded.
Fri Oct  9 12:35:29 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct  9 12:35:29 2020 - [info] 

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct  9 12:35:29 2020 - [info] Sending mail..

以下情况都不会failover, 即便是手动failover指定了 --master_state=dead 也不行

our @ALIVE_ERROR_CODES = (
  1040,    # ER_CON_COUNT_ERROR                  -- too many connection
  1042,    # ER_BAD_HOST_ERROR                   -- Can't get hostname for your address
  1043,    # ER_HANDSHAKE_ERROR                  -- Bad handshake
  1044,    # ER_DBACCESS_DENIED_ERROR            -- Access denied for user '%s'@'%s' to database '%s'
  1045,    # ER_ACCESS_DENIED_ERROR              -- Access denied for user '%s'@'%s' (using password: %s)
  1129,    # ER_HOST_IS_BLOCKED                  -- Host '%s' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'
  1130,    # ER_HOST_NOT_PRIVILEGED              -- Host '%s' is not allowed to connect to this MySQL server
  1203,    # ER_TOO_MANY_USER_CONNECTIONS        -- User %s already has more than 'max_user_connections' active connections
  1226,    # ER_USER_LIMIT_REACHED               -- User '%s' has exceeded the '%s' resource (current value: %ld)
  1251,    # ER_NOT_SUPPORTED_AUTH_MODE          -- Client does not support authentication protocol requested by server; consider upgrading MySQL client
  1275,    # ER_SERVER_IS_IN_SECURE_AUTH_MODE    -- Server is running in --secure-auth mode, but '%s'@'%s' has a password in the old format; please change the password to the new format
);

详见MHA-为什么too many connection不会failover?

[用例测试] master 与 mha manager间网络异常1

Manager <-- 不通 --> Master
Manager <-- 正常 --> S1 <-- 正常 --> master
Manager <-- 正常 --> S2 <-- 正常 --> master

ping_type=CONNECT

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

Fri Oct  9 15:29:50 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 15:29:51 2020 - [info] GTID failover mode = 1
Fri Oct  9 15:29:51 2020 - [info] Dead Servers:
Fri Oct  9 15:29:51 2020 - [info] Alive Servers:
Fri Oct  9 15:29:51 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:29:51 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 15:29:51 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 15:29:51 2020 - [info] Alive Slaves:
Fri Oct  9 15:29:51 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:29:51 2020 - [info]     GTID ON
Fri Oct  9 15:29:51 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:29:51 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:29:51 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:29:51 2020 - [info]     GTID ON
Fri Oct  9 15:29:51 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:29:51 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:29:51 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:29:51 2020 - [info] Checking slave configurations..
Fri Oct  9 15:29:51 2020 - [info] Checking replication filtering settings..
Fri Oct  9 15:29:51 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 15:29:51 2020 - [info]  Replication filtering check ok.
Fri Oct  9 15:29:51 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 15:29:51 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 15:29:51 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 15:29:51 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 15:29:51 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 15:29:51 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 15:29:51 2020 - [info]  OK.
Fri Oct  9 15:29:51 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 15:29:51 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 15:29:51 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 15:29:51 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 15:29:51 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:32:56 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:32:56 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:32:56 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct  9 15:32:56 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 15:33:00 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:00 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:33:01 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct  9 15:33:03 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:03 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:33:06 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:06 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:33:06 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:33:09 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:09 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:33:09 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:33:09 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct  9 15:33:09 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 15:33:12 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:12 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:33:14 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct  9 15:33:15 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:15 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:33:18 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:18 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:33:18 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:33:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:21 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:33:21 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:33:21 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct  9 15:33:21 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 15:33:24 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:24 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:33:26 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct  9 15:33:27 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:27 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:33:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:30 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:33:30 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:33:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:33:33 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:33:33 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:33:33 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct  9 15:33:34 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.

ping_type=INSERT

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

此时manager工作正常, 因为ping_type=INSERT是长连接.

kill连接

root@localhost 15:39:31 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time  | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |     4 |                                                               | NULL             |         0 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |    22 |                                                               | NULL             |         0 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |     9 |                                                               | NULL             |         1 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |     2 |                                                               | NULL             |         1 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |    19 |                                                               | NULL             |         0 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |   111 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |    49 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |     4 |                                                               | NULL             |         1 |             0 |
| 2409 | repler   | 172.16.120.11:32918 | NULL               | Binlog Dump GTID | 10898 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2411 | repler   | 172.16.120.12:58084 | NULL               | Binlog Dump GTID | 10873 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2627 | root     | localhost           | dbms_monitor       | Query            |     0 | starting                                                      | show processlist |         0 |             0 |
| 2836 | mha      | 172.16.120.13:35810 | NULL               | Sleep            |     2 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 15:39:39 [dbms_monitor]> kill 2836;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

Fri Oct  9 15:37:54 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 15:37:55 2020 - [info] GTID failover mode = 1
Fri Oct  9 15:37:55 2020 - [info] Dead Servers:
Fri Oct  9 15:37:55 2020 - [info] Alive Servers:
Fri Oct  9 15:37:55 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:37:55 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 15:37:55 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 15:37:55 2020 - [info] Alive Slaves:
Fri Oct  9 15:37:55 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:37:55 2020 - [info]     GTID ON
Fri Oct  9 15:37:55 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:37:55 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:37:55 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:37:55 2020 - [info]     GTID ON
Fri Oct  9 15:37:55 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:37:55 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:37:55 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:37:55 2020 - [info] Checking slave configurations..
Fri Oct  9 15:37:55 2020 - [info] Checking replication filtering settings..
Fri Oct  9 15:37:55 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 15:37:55 2020 - [info]  Replication filtering check ok.
Fri Oct  9 15:37:55 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 15:37:55 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 15:37:55 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 15:37:55 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 15:37:55 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 15:37:55 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 15:37:55 2020 - [info]  OK.
Fri Oct  9 15:37:55 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 15:37:55 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 15:37:55 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 15:37:55 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 15:37:55 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:39:46 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct  9 15:39:46 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:39:46 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Master is reachable from 172.16.120.11!
Fri Oct  9 15:39:47 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 15:39:51 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct  9 15:39:52 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:39:52 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:39:55 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:39:55 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:39:58 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:39:58 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:39:58 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:40:01 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:40:01 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:40:01 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 15:40:01 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct  9 15:40:01 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 15:40:04 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:40:04 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:40:06 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct  9 15:40:07 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:40:07 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:40:10 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:40:10 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:40:10 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:40:13 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:40:13 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:40:13 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:40:13 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Master is reachable from 172.16.120.11!
Fri Oct  9 15:40:13 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 15:40:14 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:40:14 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.

[用例测试] master 与 mha manager间网络异常2

Manager <-- 不通 --> Master
Manager <-- 不通 --> S1 <-- 正常 --> master
Manager <-- 正常 --> S2 <-- 正常 --> master

ping_type=CONNECT

slave-1

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

此时manager已经无法连通slave-1

#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.349 ms
64 bytes from centos-2 (172.16.120.11): icmp_seq=2 ttl=64 time=0.651 ms
^C
--- centos-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.349/0.500/0.651/0.151 ms

[root@centos-4 15:48:55 /usr/local/share/perl5/MHA]
#ssh centos-2
^C

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

Fri Oct  9 15:48:03 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 15:48:05 2020 - [info] GTID failover mode = 1
Fri Oct  9 15:48:05 2020 - [info] Dead Servers:
Fri Oct  9 15:48:05 2020 - [info] Alive Servers:
Fri Oct  9 15:48:05 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:48:05 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 15:48:05 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 15:48:05 2020 - [info] Alive Slaves:
Fri Oct  9 15:48:05 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:48:05 2020 - [info]     GTID ON
Fri Oct  9 15:48:05 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:48:05 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:48:05 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:48:05 2020 - [info]     GTID ON
Fri Oct  9 15:48:05 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:48:05 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:48:05 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:48:05 2020 - [info] Checking slave configurations..
Fri Oct  9 15:48:05 2020 - [info] Checking replication filtering settings..
Fri Oct  9 15:48:05 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 15:48:05 2020 - [info]  Replication filtering check ok.
Fri Oct  9 15:48:05 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 15:48:05 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 15:48:05 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 15:48:05 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 15:48:05 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 15:48:05 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 15:48:05 2020 - [info]  OK.
Fri Oct  9 15:48:05 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 15:48:05 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 15:48:05 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 15:48:05 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 15:48:05 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:50:40 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:40 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:50:40 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:50:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:44 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:50:45 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 15:50:45 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 15:50:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:47 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:50:50 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:50 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:50:50 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:50:53 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:53 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:50:53 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:50:53 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:50:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:56 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:50:58 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 15:50:58 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 15:50:59 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:50:59 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:51:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:51:02 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:51:02 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:51:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:51:05 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:51:05 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 15:51:05 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:51:05 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:51:05 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 15:51:09 2020 - [warning] Got timeout on Secondary Check child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!

ping_type=INSERT

slave-1

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

此时manager已经无法连通slave-1

#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.349 ms
64 bytes from centos-2 (172.16.120.11): icmp_seq=2 ttl=64 time=0.651 ms
^C
--- centos-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.349/0.500/0.651/0.151 ms

[root@centos-4 15:48:55 /usr/local/share/perl5/MHA]
#ssh centos-2
^C

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

因为ping_type=INSERT是长连接,1 所以此时无异常.

kill连接

root@localhost 15:39:45 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time  | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |     3 |                                                               | NULL             |         1 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |     1 |                                                               | NULL             |         0 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |     8 |                                                               | NULL             |         1 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |    11 |                                                               | NULL             |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |     8 |                                                               | NULL             |         0 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |    89 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |    26 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |     3 |                                                               | NULL             |         0 |             0 |
| 2409 | repler   | 172.16.120.11:32918 | NULL               | Binlog Dump GTID | 11837 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2411 | repler   | 172.16.120.12:58084 | NULL               | Binlog Dump GTID | 11812 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2627 | root     | localhost           | dbms_monitor       | Query            |     0 | starting                                                      | show processlist |         0 |             0 |
| 2953 | mha      | 172.16.120.13:36174 | NULL               | Sleep            |     0 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 15:55:18 [dbms_monitor]> kill 2953;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

Fri Oct  9 15:52:43 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 15:52:44 2020 - [info] GTID failover mode = 1
Fri Oct  9 15:52:44 2020 - [info] Dead Servers:
Fri Oct  9 15:52:44 2020 - [info] Alive Servers:
Fri Oct  9 15:52:44 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:52:44 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 15:52:44 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 15:52:44 2020 - [info] Alive Slaves:
Fri Oct  9 15:52:44 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:52:44 2020 - [info]     GTID ON
Fri Oct  9 15:52:44 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:52:44 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:52:44 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 15:52:44 2020 - [info]     GTID ON
Fri Oct  9 15:52:44 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:52:44 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 15:52:44 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 15:52:44 2020 - [info] Checking slave configurations..
Fri Oct  9 15:52:44 2020 - [info] Checking replication filtering settings..
Fri Oct  9 15:52:44 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 15:52:44 2020 - [info]  Replication filtering check ok.
Fri Oct  9 15:52:44 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 15:52:44 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 15:52:45 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 15:52:45 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 15:52:45 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 15:52:45 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 15:52:45 2020 - [info]  OK.
Fri Oct  9 15:52:45 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 15:52:45 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 15:52:45 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 15:52:45 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 15:52:45 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:55:24 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct  9 15:55:24 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 15:55:24 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:55:29 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 15:55:29 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 15:55:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:30 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:55:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:33 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:55:36 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:36 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:55:36 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:55:39 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:39 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:55:39 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 15:55:39 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:55:42 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:42 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:55:44 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 15:55:44 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 15:55:45 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:45 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:55:48 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:48 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:55:48 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:55:51 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:51 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:55:51 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 15:55:51 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:55:54 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:54 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 15:55:56 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 15:55:56 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 15:55:57 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:55:57 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 15:56:00 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:56:00 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 15:56:00 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 15:56:03 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 15:56:03 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 15:56:03 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 15:56:03 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 15:56:03 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 15:56:03 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct  9 15:56:03 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.

[用例测试] master 与 mha manager间网络异常3

Manager <-- 不通 --> Master
Manager <-- 不通 --> S1 <-- 正常 --> master
Manager <-- 不通 --> S2 <-- 正常 --> master

ping_type=CONNECT

slave-1, slave-2

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP
#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.442 ms
64 bytes from centos-2 (172.16.120.11): icmp_seq=2 ttl=64 time=0.441 ms
^C
--- centos-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.441/0.441/0.442/0.021 ms

[root@centos-4 16:44:27 /usr/local/share/perl5/MHA]
#ssh centos-2 
^C

[root@centos-4 16:44:30 /usr/local/share/perl5/MHA]
#ping centos-3
PING centos-3 (172.16.120.12) 56(84) bytes of data.
64 bytes from centos-3 (172.16.120.12): icmp_seq=1 ttl=64 time=0.335 ms
64 bytes from centos-3 (172.16.120.12): icmp_seq=2 ttl=64 time=0.575 ms
^C
--- centos-3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.335/0.455/0.575/0.120 ms

[root@centos-4 16:44:34 /usr/local/share/perl5/MHA]
#ssh centos-3
^C

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

Fri Oct  9 16:43:25 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 16:43:26 2020 - [info] GTID failover mode = 1
Fri Oct  9 16:43:26 2020 - [info] Dead Servers:
Fri Oct  9 16:43:26 2020 - [info] Alive Servers:
Fri Oct  9 16:43:26 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:43:26 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 16:43:26 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 16:43:26 2020 - [info] Alive Slaves:
Fri Oct  9 16:43:26 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 16:43:26 2020 - [info]     GTID ON
Fri Oct  9 16:43:26 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:43:26 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 16:43:26 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 16:43:26 2020 - [info]     GTID ON
Fri Oct  9 16:43:26 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:43:26 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 16:43:26 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:43:26 2020 - [info] Checking slave configurations..
Fri Oct  9 16:43:26 2020 - [info] Checking replication filtering settings..
Fri Oct  9 16:43:26 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 16:43:26 2020 - [info]  Replication filtering check ok.
Fri Oct  9 16:43:26 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 16:43:26 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 16:43:26 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 16:43:26 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 16:43:26 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 16:43:26 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 16:43:26 2020 - [info]  OK.
Fri Oct  9 16:43:26 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 16:43:26 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 16:43:26 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 16:43:26 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 16:43:26 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 16:45:55 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:45:55 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:45:55 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 16:45:59 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:45:59 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:46:00 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 16:46:00 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 16:46:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:46:02 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 16:46:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:46:05 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 16:46:05 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 16:46:08 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:46:08 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 16:46:08 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:46:08 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 16:46:11 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:46:11 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:46:13 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 16:46:13 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 16:46:14 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:46:14 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 16:46:15 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

ping_type=INSERT

slave-1,slave-2

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP
#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.352 ms
^C
--- centos-2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.352/0.352/0.352/0.000 ms

[root@centos-4 17:52:38 ~]
#ping centos-3
PING centos-3 (172.16.120.12) 56(84) bytes of data.
64 bytes from centos-3 (172.16.120.12): icmp_seq=1 ttl=64 time=0.221 ms
^C
--- centos-3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.221/0.221/0.221/0.000 ms

[root@centos-4 17:52:41 ~]
#ssh centos-2 
^C

[root@centos-4 17:52:44 ~]
#ssh centos-3

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

由于ping_type=INSERT是长连接, 所以无异常

kill连接

root@localhost 17:48:11 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time  | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |    14 |                                                               | NULL             |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |     4 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |    32 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |     4 |                                                               | NULL             |         1 |             0 |
| 2409 | repler   | 172.16.120.11:32918 | NULL               | Binlog Dump GTID | 18936 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2411 | repler   | 172.16.120.12:58084 | NULL               | Binlog Dump GTID | 18911 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 3192 | proxysql | 172.16.120.11:33786 | NULL               | Sleep            |     4 |                                                               | NULL             |         0 |             0 |
| 3238 | proxysql | 172.16.120.12:59006 | NULL               | Sleep            |     4 |                                                               | NULL             |         1 |             0 |
| 3245 | proxysql | 172.16.120.11:33888 | NULL               | Sleep            |    14 |                                                               | NULL             |         0 |             0 |
| 3262 | root     | localhost           | dbms_monitor       | Query            |     0 | starting                                                      | show processlist |         0 |             0 |
| 3268 | mha      | 172.16.120.13:36868 | NULL               | Sleep            |     2 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
11 rows in set (0.00 sec)

root@localhost 17:53:37 [dbms_monitor]> kill 3268;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

Fri Oct  9 17:50:48 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 17:50:49 2020 - [info] GTID failover mode = 1
Fri Oct  9 17:50:49 2020 - [info] Dead Servers:
Fri Oct  9 17:50:49 2020 - [info] Alive Servers:
Fri Oct  9 17:50:49 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 17:50:49 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 17:50:49 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 17:50:49 2020 - [info] Alive Slaves:
Fri Oct  9 17:50:49 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 17:50:49 2020 - [info]     GTID ON
Fri Oct  9 17:50:49 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 17:50:49 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 17:50:49 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 17:50:49 2020 - [info]     GTID ON
Fri Oct  9 17:50:49 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 17:50:49 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 17:50:49 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 17:50:49 2020 - [info] Checking slave configurations..
Fri Oct  9 17:50:49 2020 - [info] Checking replication filtering settings..
Fri Oct  9 17:50:49 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 17:50:49 2020 - [info]  Replication filtering check ok.
Fri Oct  9 17:50:49 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 17:50:49 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 17:50:49 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 17:50:49 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 17:50:49 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 17:50:49 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 17:50:49 2020 - [info]  OK.
Fri Oct  9 17:50:49 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 17:50:49 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 17:50:49 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 17:50:49 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 17:50:49 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 17:53:43 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct  9 17:53:43 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 17:53:43 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 17:53:48 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 17:53:48 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 17:53:49 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:53:49 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 17:53:52 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:53:52 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 17:53:55 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:53:55 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 17:53:55 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 17:53:58 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:53:58 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 17:53:58 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 17:53:58 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 17:54:01 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:54:01 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 17:54:03 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 17:54:03 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 17:54:04 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:54:04 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 17:54:07 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:54:07 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 17:54:07 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 17:54:10 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:54:10 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 17:54:10 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 17:54:10 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 17:54:13 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:54:13 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 17:54:15 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct  9 17:54:15 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct  9 17:54:16 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 17:54:16 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 17:54:16 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

[用例测试] master 与 mha manager间网络异常4

Manager <-- 不通 --> Master
Manager <-- 正常 --> S1 <-- 不通 --> master
Manager <-- 正常 --> S2 <-- 正常 --> master

ping_type=CONNECT

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

Fri Oct  9 16:05:55 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 16:05:56 2020 - [info] GTID failover mode = 1
Fri Oct  9 16:05:56 2020 - [info] Dead Servers:
Fri Oct  9 16:05:56 2020 - [info] Alive Servers:
Fri Oct  9 16:05:56 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:05:56 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 16:05:56 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 16:05:56 2020 - [info] Alive Slaves:
Fri Oct  9 16:05:56 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 16:05:56 2020 - [info]     GTID ON
Fri Oct  9 16:05:56 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:05:56 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 16:05:56 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 16:05:56 2020 - [info]     GTID ON
Fri Oct  9 16:05:56 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:05:56 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 16:05:56 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:05:56 2020 - [info] Checking slave configurations..
Fri Oct  9 16:05:56 2020 - [info] Checking replication filtering settings..
Fri Oct  9 16:05:56 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 16:05:56 2020 - [info]  Replication filtering check ok.
Fri Oct  9 16:05:56 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 16:05:56 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 16:05:56 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 16:05:56 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 16:05:56 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 16:05:56 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 16:05:56 2020 - [info]  OK.
Fri Oct  9 16:05:56 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 16:05:56 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 16:05:56 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 16:05:56 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 16:05:56 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 16:06:43 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:06:43 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:06:43 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 16:06:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:06:47 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:06:48 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct  9 16:06:48 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 16:06:50 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:06:50 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 16:06:53 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:06:53 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 16:06:53 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 16:06:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:06:56 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 16:06:56 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 16:06:56 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:06:59 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:06:59 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:07:01 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct  9 16:07:01 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 16:07:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:07:02 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 16:07:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:07:05 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 16:07:05 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 16:07:05 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

ping_type=INSERT

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

由于ping_type=INSERT是长连接, 所以无异常

kill连接

root@localhost 15:55:23 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time  | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |    19 |                                                               | NULL             |         1 |             0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL               | Sleep            |     7 |                                                               | NULL             |         0 |             0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL               | Sleep            |     4 |                                                               | NULL             |         0 |             0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL               | Sleep            |    27 |                                                               | NULL             |         1 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |    14 |                                                               | NULL             |         1 |             0 |
| 1341 | mha      | 172.16.120.12:56698 | information_schema | Sleep            |   114 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |    51 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |     9 |                                                               | NULL             |         0 |             0 |
| 2409 | repler   | 172.16.120.11:32918 | NULL               | Binlog Dump GTID | 12703 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2411 | repler   | 172.16.120.12:58084 | NULL               | Binlog Dump GTID | 12678 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 2627 | root     | localhost           | dbms_monitor       | Query            |     0 | starting                                                      | show processlist |         0 |             0 |
| 3022 | mha      | 172.16.120.13:36466 | NULL               | Sleep            |     2 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 16:09:44 [dbms_monitor]> kill 3022;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

Fri Oct  9 16:08:29 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 16:08:30 2020 - [info] GTID failover mode = 1
Fri Oct  9 16:08:30 2020 - [info] Dead Servers:
Fri Oct  9 16:08:30 2020 - [info] Alive Servers:
Fri Oct  9 16:08:30 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:08:30 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 16:08:30 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 16:08:30 2020 - [info] Alive Slaves:
Fri Oct  9 16:08:30 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 16:08:30 2020 - [info]     GTID ON
Fri Oct  9 16:08:30 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:08:30 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 16:08:30 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 16:08:30 2020 - [info]     GTID ON
Fri Oct  9 16:08:30 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:08:30 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 16:08:30 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 16:08:30 2020 - [info] Checking slave configurations..
Fri Oct  9 16:08:30 2020 - [info] Checking replication filtering settings..
Fri Oct  9 16:08:30 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 16:08:30 2020 - [info]  Replication filtering check ok.
Fri Oct  9 16:08:30 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 16:08:30 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 16:08:30 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 16:08:30 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 16:08:30 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 16:08:30 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 16:08:30 2020 - [info]  OK.
Fri Oct  9 16:08:30 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 16:08:30 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 16:08:30 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 16:08:30 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 16:08:30 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 16:09:51 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct  9 16:09:51 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 16:09:51 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:09:56 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct  9 16:09:57 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 16:09:57 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:09:57 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:10:00 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:00 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 16:10:03 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:03 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 16:10:03 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 16:10:06 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:06 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 16:10:06 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 16:10:06 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:10:09 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:09 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:10:11 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct  9 16:10:12 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct  9 16:10:12 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:12 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 16:10:15 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:15 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 16:10:15 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct  9 16:10:18 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:18 2020 - [warning] Connection failed 1 time(s)..
Fri Oct  9 16:10:18 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 16:10:18 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 16:10:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 16:10:21 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 16:10:21 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 16:10:21 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct  9 16:10:21 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.

[用例测试] master 与 mha manager间网络异常5

Manager <-- 不通 --> Master
Manager <-- 正常 --> S1 <-- 不通 --> master
Manager <-- 正常 --> S2 <-- 不通 --> master

ping_type=CONNECT

master

IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 会failover

Fri Oct  9 18:21:13 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 18:21:14 2020 - [info] GTID failover mode = 1
Fri Oct  9 18:21:14 2020 - [info] Dead Servers:
Fri Oct  9 18:21:14 2020 - [info] Alive Servers:
Fri Oct  9 18:21:14 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:21:14 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:21:14 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:21:14 2020 - [info] Alive Slaves:
Fri Oct  9 18:21:14 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:21:14 2020 - [info]     GTID ON
Fri Oct  9 18:21:14 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:21:14 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:21:14 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:21:14 2020 - [info]     GTID ON
Fri Oct  9 18:21:14 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:21:14 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:21:14 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:21:14 2020 - [info] Checking slave configurations..
Fri Oct  9 18:21:14 2020 - [info] Checking replication filtering settings..
Fri Oct  9 18:21:14 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 18:21:14 2020 - [info]  Replication filtering check ok.
Fri Oct  9 18:21:14 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 18:21:14 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 18:21:14 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 18:21:14 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 18:21:14 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 18:21:14 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 18:21:14 2020 - [info]  OK.
Fri Oct  9 18:21:14 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 18:21:14 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 18:21:14 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 18:21:14 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 18:21:14 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 18:22:07 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:22:07 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 18:22:07 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct  9 18:22:11 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:22:11 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 18:22:12 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Fri Oct  9 18:22:14 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:22:14 2020 - [warning] Connection failed 3 time(s)..
Fri Oct  9 18:22:17 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:22:17 2020 - [warning] Connection failed 4 time(s)..
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Fri Oct  9 18:22:18 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct  9 18:22:18 2020 - [warning] Master is not reachable from health checker!
Fri Oct  9 18:22:18 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Fri Oct  9 18:22:18 2020 - [warning] SSH is NOT reachable.
Fri Oct  9 18:22:18 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Fri Oct  9 18:22:18 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Oct  9 18:22:18 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct  9 18:22:18 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct  9 18:22:19 2020 - [info] GTID failover mode = 1
Fri Oct  9 18:22:19 2020 - [info] Dead Servers:
Fri Oct  9 18:22:19 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:19 2020 - [info] Alive Servers:
Fri Oct  9 18:22:19 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:22:19 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:22:19 2020 - [info] Alive Slaves:
Fri Oct  9 18:22:19 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:19 2020 - [info]     GTID ON
Fri Oct  9 18:22:19 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:19 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:19 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:19 2020 - [info]     GTID ON
Fri Oct  9 18:22:19 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:19 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:19 2020 - [info] Checking slave configurations..
Fri Oct  9 18:22:19 2020 - [info] Checking replication filtering settings..
Fri Oct  9 18:22:19 2020 - [info]  Replication filtering check ok.
Fri Oct  9 18:22:19 2020 - [info] Master is down!
Fri Oct  9 18:22:19 2020 - [info] Terminating monitoring script.
Fri Oct  9 18:22:19 2020 - [info] Got exit code 20 (Master dead).
Fri Oct  9 18:22:19 2020 - [info] MHA::MasterFailover version 0.58.
Fri Oct  9 18:22:19 2020 - [info] Starting master failover.
Fri Oct  9 18:22:19 2020 - [info] 
Fri Oct  9 18:22:19 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Oct  9 18:22:19 2020 - [info] 
Fri Oct  9 18:22:20 2020 - [info] GTID failover mode = 1
Fri Oct  9 18:22:20 2020 - [info] Dead Servers:
Fri Oct  9 18:22:20 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:20 2020 - [info] Checking master reachability via MySQL(double check)...
Fri Oct  9 18:22:21 2020 - [info]  ok.
Fri Oct  9 18:22:21 2020 - [info] Alive Servers:
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:22:21 2020 - [info] Alive Slaves:
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info] Starting GTID based failover.
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct  9 18:22:21 2020 - [info] Executing master IP deactivation script:
Fri Oct  9 18:22:21 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stop 
Disabling the VIP on old master: 172.16.120.10 
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1 
Fri Oct  9 18:22:21 2020 - [info]  done.
Fri Oct  9 18:22:21 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct  9 18:22:21 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000009:3084318
Fri Oct  9 18:22:21 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:10391-19970
Fri Oct  9 18:22:21 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000009:3084318
Fri Oct  9 18:22:21 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:10391-19970
Fri Oct  9 18:22:21 2020 - [info] Oldest slaves:
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] Searching new master from slaves..
Fri Oct  9 18:22:21 2020 - [info]  Candidate masters from the configuration file:
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:22:21 2020 - [info]     GTID ON
Fri Oct  9 18:22:21 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:22:21 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:22:21 2020 - [info]  Non-candidate masters:
Fri Oct  9 18:22:21 2020 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Fri Oct  9 18:22:21 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:22:21 2020 - [info] Starting master failover..
Fri Oct  9 18:22:21 2020 - [info] 
From:
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
 +--172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info]  Waiting all logs to be applied.. 
Fri Oct  9 18:22:21 2020 - [info]   done.
Fri Oct  9 18:22:21 2020 - [info] Getting new master's binlog name and position..
Fri Oct  9 18:22:21 2020 - [info]  mysql-bin.000008:3052407
Fri Oct  9 18:22:21 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Oct  9 18:22:21 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000008, 3052407, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-19970,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Fri Oct  9 18:22:21 2020 - [info] Executing master IP activate script:
Fri Oct  9 18:22:21 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha'   --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11 
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0 
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Oct  9 18:22:21 2020 - [info]  OK.
Fri Oct  9 18:22:21 2020 - [info] ** Finished master recovery successfully.
Fri Oct  9 18:22:21 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Oct  9 18:22:21 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 68999. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201009182219.log if it takes time..
Fri Oct  9 18:22:22 2020 - [info] 
Fri Oct  9 18:22:22 2020 - [info] Log messages from 172.16.120.12 ...
Fri Oct  9 18:22:22 2020 - [info] 
Fri Oct  9 18:22:21 2020 - [info]  Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Fri Oct  9 18:22:21 2020 - [info]  Executed CHANGE MASTER.
Fri Oct  9 18:22:21 2020 - [info]  Slave started.
Fri Oct  9 18:22:21 2020 - [info]  gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-19970,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Fri Oct  9 18:22:22 2020 - [info] End of log messages from 172.16.120.12.
Fri Oct  9 18:22:22 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Fri Oct  9 18:22:22 2020 - [info] All new slave servers recovered successfully.
Fri Oct  9 18:22:22 2020 - [info] 
Fri Oct  9 18:22:22 2020 - [info] * Phase 5: New master cleanup phase..
Fri Oct  9 18:22:22 2020 - [info] 
Fri Oct  9 18:22:22 2020 - [info] Resetting slave info on the new master..
Fri Oct  9 18:22:22 2020 - [info]  172.16.120.11: Resetting slave info succeeded.
Fri Oct  9 18:22:22 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct  9 18:22:22 2020 - [info] 

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct  9 18:22:22 2020 - [info] Sending mail..

ping_type=INSERT

由于ping_type=INSERT是长连接, 所以无异常

kill连接

root@localhost 18:24:52 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id   | User     | Host                | db                 | Command          | Time | State                                                         | Info             | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL               | Sleep            |    1 |                                                               | NULL             |         0 |             0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL               | Sleep            |    2 |                                                               | NULL             |         0 |             0 |
| 1343 | mha      | 172.16.120.11:59758 | information_schema | Sleep            |   74 |                                                               | NULL             |         0 |             0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL               | Sleep            |    2 |                                                               | NULL             |         1 |             0 |
| 3192 | proxysql | 172.16.120.11:33786 | NULL               | Sleep            |    3 |                                                               | NULL             |         1 |             0 |
| 3238 | proxysql | 172.16.120.12:59006 | NULL               | Sleep            |    1 |                                                               | NULL             |         1 |             0 |
| 3245 | proxysql | 172.16.120.11:33888 | NULL               | Sleep            |    2 |                                                               | NULL             |         0 |             0 |
| 3262 | root     | localhost           | dbms_monitor       | Query            |    0 | starting                                                      | show processlist |         0 |             0 |
| 3357 | repler   | 172.16.120.11:34036 | NULL               | Binlog Dump GTID |  142 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 3359 | repler   | 172.16.120.12:59166 | NULL               | Binlog Dump GTID |  123 | Master has sent all binlog to slave; waiting for more updates | NULL             |         0 |             0 |
| 3364 | mha      | 172.16.120.13:37512 | NULL               | Sleep            |    2 |                                                               | NULL             |         0 |             0 |
+------+----------+---------------------+--------------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
11 rows in set (0.00 sec)

root@localhost 18:26:25 [dbms_monitor]> kill 3364;
Query OK, 0 rows affected (0.01 sec)

结论: 长连接断开后才会failover, 否则不会failover

Fri Oct  9 18:25:33 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct  9 18:25:34 2020 - [info] GTID failover mode = 1
Fri Oct  9 18:25:34 2020 - [info] Dead Servers:
Fri Oct  9 18:25:34 2020 - [info] Alive Servers:
Fri Oct  9 18:25:34 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:25:34 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:25:34 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:25:34 2020 - [info] Alive Slaves:
Fri Oct  9 18:25:34 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:25:34 2020 - [info]     GTID ON
Fri Oct  9 18:25:34 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:25:34 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:25:34 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:25:34 2020 - [info]     GTID ON
Fri Oct  9 18:25:34 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:25:34 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:25:34 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:25:34 2020 - [info] Checking slave configurations..
Fri Oct  9 18:25:34 2020 - [info] Checking replication filtering settings..
Fri Oct  9 18:25:34 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Oct  9 18:25:34 2020 - [info]  Replication filtering check ok.
Fri Oct  9 18:25:34 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct  9 18:25:34 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct  9 18:25:35 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct  9 18:25:35 2020 - [info] 
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

Fri Oct  9 18:25:35 2020 - [info] Checking master_ip_failover_script status:
Fri Oct  9 18:25:35 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 
Fri Oct  9 18:25:35 2020 - [info]  OK.
Fri Oct  9 18:25:35 2020 - [warning] shutdown_script is not defined.
Fri Oct  9 18:25:35 2020 - [info] Set master ping interval 3 seconds.
Fri Oct  9 18:25:35 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct  9 18:25:35 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct  9 18:25:35 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct  9 18:26:44 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct  9 18:26:44 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12  --user=root  --master_host=172.16.120.10  --master_ip=172.16.120.10  --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct  9 18:26:44 2020 - [info] Executing SSH check script: exit 0
Fri Oct  9 18:26:49 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Fri Oct  9 18:26:50 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:26:50 2020 - [warning] Connection failed 2 time(s)..
Fri Oct  9 18:26:53 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:26:53 2020 - [warning] Connection failed 3 time(s)..
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Fri Oct  9 18:26:54 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct  9 18:26:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct  9 18:26:56 2020 - [warning] Connection failed 4 time(s)..
Fri Oct  9 18:26:56 2020 - [warning] Master is not reachable from health checker!
Fri Oct  9 18:26:56 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Fri Oct  9 18:26:56 2020 - [warning] SSH is NOT reachable.
Fri Oct  9 18:26:56 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Fri Oct  9 18:26:56 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Oct  9 18:26:56 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct  9 18:26:56 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct  9 18:26:57 2020 - [info] GTID failover mode = 1
Fri Oct  9 18:26:57 2020 - [info] Dead Servers:
Fri Oct  9 18:26:57 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:57 2020 - [info] Alive Servers:
Fri Oct  9 18:26:57 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:26:57 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:26:57 2020 - [info] Alive Slaves:
Fri Oct  9 18:26:57 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:57 2020 - [info]     GTID ON
Fri Oct  9 18:26:57 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:57 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:57 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:57 2020 - [info]     GTID ON
Fri Oct  9 18:26:57 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:57 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:57 2020 - [info] Checking slave configurations..
Fri Oct  9 18:26:57 2020 - [info] Checking replication filtering settings..
Fri Oct  9 18:26:57 2020 - [info]  Replication filtering check ok.
Fri Oct  9 18:26:57 2020 - [info] Master is down!
Fri Oct  9 18:26:57 2020 - [info] Terminating monitoring script.
Fri Oct  9 18:26:57 2020 - [info] Got exit code 20 (Master dead).
Fri Oct  9 18:26:57 2020 - [info] MHA::MasterFailover version 0.58.
Fri Oct  9 18:26:57 2020 - [info] Starting master failover.
Fri Oct  9 18:26:57 2020 - [info] 
Fri Oct  9 18:26:57 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Oct  9 18:26:57 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] GTID failover mode = 1
Fri Oct  9 18:26:58 2020 - [info] Dead Servers:
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info] Alive Servers:
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:26:58 2020 - [info] Alive Slaves:
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info] Starting GTID based failover.
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct  9 18:26:58 2020 - [info] Executing master IP deactivation script:
Fri Oct  9 18:26:58 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stop 
Disabling the VIP on old master: 172.16.120.10 
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1 
Fri Oct  9 18:26:58 2020 - [info]  done.
Fri Oct  9 18:26:58 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct  9 18:26:58 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000009:3101017
Fri Oct  9 18:26:58 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:19971-20041
Fri Oct  9 18:26:58 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000009:3101017
Fri Oct  9 18:26:58 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:19971-20041
Fri Oct  9 18:26:58 2020 - [info] Oldest slaves:
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] Searching new master from slaves..
Fri Oct  9 18:26:58 2020 - [info]  Candidate masters from the configuration file:
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.11(172.16.120.11:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info]   172.16.120.12(172.16.120.12:3358)  Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct  9 18:26:58 2020 - [info]     GTID ON
Fri Oct  9 18:26:58 2020 - [info]     Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct  9 18:26:58 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Fri Oct  9 18:26:58 2020 - [info]  Non-candidate masters:
Fri Oct  9 18:26:58 2020 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Fri Oct  9 18:26:58 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Fri Oct  9 18:26:58 2020 - [info] Starting master failover..
Fri Oct  9 18:26:58 2020 - [info] 
From:
172.16.120.10(172.16.120.10:3358) (current master)
 +--172.16.120.11(172.16.120.11:3358)
 +--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
 +--172.16.120.12(172.16.120.12:3358)
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info]  Waiting all logs to be applied.. 
Fri Oct  9 18:26:58 2020 - [info]   done.
Fri Oct  9 18:26:58 2020 - [info] Getting new master's binlog name and position..
Fri Oct  9 18:26:58 2020 - [info]  mysql-bin.000008:3068991
Fri Oct  9 18:26:58 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Oct  9 18:26:58 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000008, 3068991, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20041,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Fri Oct  9 18:26:58 2020 - [info] Executing master IP activate script:
Fri Oct  9 18:26:58 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha'   --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11 
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0 
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Oct  9 18:26:58 2020 - [info]  OK.
Fri Oct  9 18:26:58 2020 - [info] ** Finished master recovery successfully.
Fri Oct  9 18:26:58 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Oct  9 18:26:58 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 69850. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201009182657.log if it takes time..
Fri Oct  9 18:26:59 2020 - [info] 
Fri Oct  9 18:26:59 2020 - [info] Log messages from 172.16.120.12 ...
Fri Oct  9 18:26:59 2020 - [info] 
Fri Oct  9 18:26:58 2020 - [info]  Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Fri Oct  9 18:26:58 2020 - [info]  Executed CHANGE MASTER.
Fri Oct  9 18:26:58 2020 - [info]  Slave started.
Fri Oct  9 18:26:58 2020 - [info]  gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20041,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Fri Oct  9 18:26:59 2020 - [info] End of log messages from 172.16.120.12.
Fri Oct  9 18:26:59 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Fri Oct  9 18:26:59 2020 - [info] All new slave servers recovered successfully.
Fri Oct  9 18:26:59 2020 - [info] 
Fri Oct  9 18:26:59 2020 - [info] * Phase 5: New master cleanup phase..
Fri Oct  9 18:26:59 2020 - [info] 
Fri Oct  9 18:26:59 2020 - [info] Resetting slave info on the new master..
Fri Oct  9 18:26:59 2020 - [info]  172.16.120.11: Resetting slave info succeeded.
Fri Oct  9 18:26:59 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct  9 18:26:59 2020 - [info] 

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct  9 18:26:59 2020 - [info] Sending mail..
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值