一、搭建环境:
OS:Linux.el6uek.x86_64
MySQL:5.7.26
MHA:0.56
二、模式:
1 主 2 从 1 Manager
192.168.65.2 test2 # master MHA Node
192.168.65.3 test3 # slave1 MHA Node
192.168.65.4 test4 # slave2 MHA Node
192.168.65.5 test5 # MHA Manger
三、配置主从复制
本文重点不在主从复制(GTID),故此步省略,有需要可以看我其他文章。
PS:从库一定要配置 set global read_only=1 和 set global relay_log_purge=0
四、配置 ssh 互信
ps:一下步骤凡是标注为 ALL 的,则表示每台服务器都需执行。
1.修改 hosts 文件(ALL)
# vi /etc/hosts 添加如下内容:
192.168.65.2 test2
192.168.65.3 test3
192.168.65.5 test5
192.168.65.4 test4
2.配置 ssh 互信(ALL)
输入 ssh-keygen -t rsa 命令,按三下回车,如下:
[root@test2 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
ea:55:f3:80:99:ae:4d:40:ee:62:11:fd:f1:df:38:68 root@test2
The key's randomart image is:
+--[ RSA 2048]----+
| |
| . |
| . o . |
| + . * |
| . o S = |
| o + . * o |
| o o + E = . |
| . o = . . |
| o . |
+-----------------+
然后
[root@test2 ~]# ssh-copy-id 192.168.65.2
[root@test2 ~]# ssh-copy-id 192.168.65.3
[root@test2 ~]# ssh-copy-id 192.168.65.4
[root@test2 ~]# ssh-copy-id 192.168.65.5
3.验证互信(ALL)
[root@test2 ~]# ssh 192.168.65.2
[root@test2 ~]# ssh 192.168.65.3
[root@test2 ~]# ssh 192.168.65.4
[root@test2 ~]# ssh 192.168.65.5
如果都显示时间并连过去,则表示成功。
五、安装 MHA Manager 和 MHA Node
1.安装依赖包(ALL)
yum install -y perl-DBD-MySQL perl-Params-Validate
2.安装 MHA Node(ALL)
[root@test5 mha_soft]# yum install -y mha4mysql-node-0.56-0.el6.noarch.rpm
3.安装 MHA Manger (管理节点:192.168.65.5)
a.安装依赖包
[root@test5 mha_soft]# yum install -y
compat-db43-4.3.29-15.el6.x86_64.rpm
perl-Config-Tiny-2.12-7.1.el6.noarch.rpm
perl-Email-Date-Format-1.002-5.el6.noarch.rpm
perl-Log-Dispatch-2.27-1.el6.noarch.rpm
perl-Mail-Sender-0.8.16-3.el6.noarch.rpm perl-Mail-Sendmail-0.79-12.el6.noarch.rpm
perl-MIME-Lite-3.027-2.el6.noarch.rpm
perl-MIME-Types-1.28-2.el6.noarch.rpm
perl-Parallel-ForkManager-0.7.9-1.el6.noarch.rpm
b.安装 MHA Manager
[root@test5 mha_soft]# yum install -y mha4mysql-manager-0.56-0.el6.noarch.rpm
c.查看是否安装成功
[root@test5 mha_soft]# masterha_manager --version
六、配置 MHA Manager 参数文件
[root@test5 ]# mkdir /etc/masterha/app1/ -p
[root@test5 ]# cd
[root@test5 app1]# vi app1.cnf
[server default]
user=root
password=123
manager_workdir=/etc/masterha/app1
manager_log=/etc/masterha/app1/manager.log
remote_workdir = /etc/masterha/app1
ssh_user=root
repl_user=rpl
repl_password=123
ping_interval=1
master_ip_failover_script=/etc/masterha/app1/master_ip_failover #master failover时执行,不配置vip时不用配
#shutdown_script=/etc/masterha/power_manager
#report_script=/etc/masterha/app1/send_report #master failover时执行,可选
master_ip_online_change_script=/etc/masterha/app1/master_ip_online_change #masterswitchover时执行,不配置vip时不用配
[server1]
hostname=192.168.65.2
port=3306
master_binlog_dir=/data/mysql/data
check_repl_delay=0
candidate_master=1
[server2]
hostname=192.168.65.3
port=3306
master_binlog_dir=/data/mysql/data
candidate_master=1 #如果候选master有延迟的话,relay日志超过100m,failover切换不能成功,加上此参数后会忽略延迟日志大小。
check_repl_delay=0
[server3]
hostname=192.168.65.4
port=3308
master_binlog_dir=/data/mysql/logs
check_repl_delay=0
ignore_fail=1 #如果这个节点挂了,mha将不可用,加上这个参数,slave挂了一样可以用
no_master=1 #从不将这台主机转换为master
PS:
1.这里备注是让大家明白参数的含义,但正式的配置文件中一定不能有中文,否则后面的检查不能通过。
2.标红色的2个参数,是可选参数,如果没有该脚本或不需要这个功能,可以从脚本中去掉或注释即可。
3.脚本请参考另一篇文章,如果不配置,请注释,否则下一步的复制检查将不能通过。
4.不行编辑配置文件的,可以通过命令添加:
masterha_conf_host --command=add --conf=/etc/conf/masterha/app1.cnf --hostname=db101 --block=server100 --params="no_master=1;ignore_fail=1"
--block :指定段落名,就是 [ ] 里面的显示内容,如:[Server1]
删除:masterha_conf_host --command=delete --conf=/etc/conf/masterha/app1.cnf --block=server100
参数解释:
--command=delete|add #从一个配置文件中添加/删除一个主机记录
--block #显式指定配置文件中,主机配置文件的段落标记
--conf #指定本地配置文件路径
--hostname #指定添加记录的主机名或者IP
--params #指定参数列表,使用分号作为参数间隔,所有参数使用引号引起来
七、状态检查
1.ssh 互信检查
[root@test5 app1]# masterha_check_ssh --conf=/etc/masterha/app1/app1.cnf
Sat Jul 20 09:44:47 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Jul 20 09:44:47 2019 - [info] Reading application default configuration from /etc/masterha/app1/app1.cnf..
Sat Jul 20 09:44:47 2019 - [info] Reading server configuration from /etc/masterha/app1/app1.cnf..
Sat Jul 20 09:44:47 2019 - [info] Starting SSH connection tests..
Sat Jul 20 09:44:48 2019 - [debug]
Sat Jul 20 09:44:47 2019 - [debug] Connecting via SSH from root@192.168.65.2(192.168.65.2:22) to
root@192.168.65.3(192.168.65.3:22)..
Sat Jul 20 09:44:47 2019 - [debug] ok.
Sat Jul 20 09:44:47 2019 - [debug] Connecting via SSH from root@192.168.65.2(192.168.65.2:22) to
root@192.168.65.4(192.168.65.4:22)..
Sat Jul 20 09:44:48 2019 - [debug] ok.
Sat Jul 20 09:44:48 2019 - [debug]
Sat Jul 20 09:44:47 2019 - [debug] Connecting via SSH from root@192.168.65.3(192.168.65.3:22) to
root@192.168.65.2(192.168.65.2:22)..
Sat Jul 20 09:44:48 2019 - [debug] ok.
Sat Jul 20 09:44:48 2019 - [debug] Connecting via SSH from root@192.168.65.3(192.168.65.3:22) to
root@192.168.65.4(192.168.65.4:22)..
Sat Jul 20 09:44:48 2019 - [debug] ok.
Sat Jul 20 09:44:49 2019 - [debug]
Sat Jul 20 09:44:48 2019 - [debug] Connecting via SSH from root@192.168.65.4(192.168.65.4:22) to
root@192.168.65.2(192.168.65.2:22)..
Sat Jul 20 09:44:48 2019 - [debug] ok.
Sat Jul 20 09:44:48 2019 - [debug] Connecting via SSH from root@192.168.65.4(192.168.65.4:22) to
root@192.168.65.3(192.168.65.3:22)..
Sat Jul 20 09:44:49 2019 - [debug] ok.
Sat Jul 20 09:44:49 2019 - [info] All SSH connection tests passed successfully.
出现 successfully 即可。
2.检查复制状态
[root@test5 app1]# masterha_check_repl --conf=/etc/masterha/app1/app1.cnf
Sat Jul 20 10:02:30 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Jul 20 10:02:30 2019 - [info] Reading application default configuration from /etc/masterha/app1/app1.cnf..
Sat Jul 20 10:02:30 2019 - [info] Reading server configuration from /etc/masterha/app1/app1.cnf..
Sat Jul 20 10:02:30 2019 - [info] MHA::MasterMonitor version 0.56.
Sat Jul 20 10:02:31 2019 - [info] GTID failover mode = 1
Sat Jul 20 10:02:31 2019 - [info] Dead Servers:
Sat Jul 20 10:02:31 2019 - [info] Alive Servers:
Sat Jul 20 10:02:31 2019 - [info] 192.168.65.2(192.168.65.2:3306)
Sat Jul 20 10:02:31 2019 - [info] 192.168.65.3(192.168.65.3:3306)
Sat Jul 20 10:02:31 2019 - [info] 192.168.65.4(192.168.65.4:3308)
Sat Jul 20 10:02:31 2019 - [info] Alive Slaves:
Sat Jul 20 10:02:31 2019 - [info] 192.168.65.3(192.168.65.3:3306) Version=5.7.26-log (oldest major version between
slaves) log-bin:enabled
Sat Jul 20 10:02:31 2019 - [info] GTID ON
Sat Jul 20 10:02:31 2019 - [info] Replicating from 192.168.65.2(192.168.65.2:3306)
Sat Jul 20 10:02:31 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Jul 20 10:02:31 2019 - [info] 192.168.65.4(192.168.65.4:3308) Version=5.7.26-log (oldest major version between
slaves) log-bin:enabled
Sat Jul 20 10:02:31 2019 - [info] GTID ON
Sat Jul 20 10:02:31 2019 - [info] Replicating from 192.168.65.2(192.168.65.2:3306)
Sat Jul 20 10:02:31 2019 - [info] Not candidate for the new Master (no_master is set)
Sat Jul 20 10:02:31 2019 - [info] Current Alive Master: 192.168.65.2(192.168.65.2:3306)
Sat Jul 20 10:02:31 2019 - [info] Checking slave configurations..
Sat Jul 20 10:02:31 2019 - [info] read_only=1 is not set on slave 192.168.65.3(192.168.65.3:3306).
Sat Jul 20 10:02:31 2019 - [info] read_only=1 is not set on slave 192.168.65.4(192.168.65.4:3308).
Sat Jul 20 10:02:31 2019 - [info] Checking replication filtering settings..
Sat Jul 20 10:02:31 2019 - [info] binlog_do_db= , binlog_ignore_db=
Sat Jul 20 10:02:31 2019 - [info] Replication filtering check ok.
Sat Jul 20 10:02:31 2019 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Jul 20 10:02:31 2019 - [info] Checking SSH publickey authentication settings on the current master..
Sat Jul 20 10:02:31 2019 - [info] HealthCheck: SSH to 192.168.65.2 is reachable.
Sat Jul 20 10:02:31 2019 - [info]
192.168.65.2(192.168.65.2:3306) (current master)
+--192.168.65.3(192.168.65.3:3306)
+--192.168.65.4(192.168.65.4:3308)
Sat Jul 20 10:02:31 2019 - [info] Checking replication health on 192.168.65.3..
Sat Jul 20 10:02:31 2019 - [info] ok.
Sat Jul 20 10:02:31 2019 - [info] Checking replication health on 192.168.65.4..
Sat Jul 20 10:02:31 2019 - [info] ok.
Sat Jul 20 10:02:31 2019 - [info] Checking master_ip_failover_script status:
Sat Jul 20 10:02:31 2019 - [info] /etc/masterha/app1/master_ip_failover --command=status --ssh_user=root --
orig_master_host=192.168.65.2 --orig_master_ip=192.168.65.2 --orig_master_port=3306
Checking the Status of the script.. OK
Sat Jul 20 10:02:31 2019 - [info] OK.
Sat Jul 20 10:02:31 2019 - [warning] shutdown_script is not defined.
Sat Jul 20 10:02:31 2019 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
八、启动 & 停止 & 状态
1.启动
[root@test5 app1]# nohup masterha_manager --conf=/etc/masterha/app1/app1.cnf > /etc/masterha/app1/mha_manager.log 2>&1 &
PS:启动后,你在master_ip_failover_script 和 master_ip_online_change_script 中配置的 VIP 就会在 master 上产生
[root@test2 mha]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:DB:FC:D1
inet addr:192.168.65.2 Bcast:192.168.65.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fedb:fcd1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1882124 errors:0 dropped:0 overruns:0 frame:0
TX packets:1869349 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:320631172 (305.7 MiB) TX bytes:320699318 (305.8 MiB)
eth0:1 Link encap:Ethernet HWaddr 00:0C:29:DB:FC:D1
inet addr:192.168.65.21 Bcast:192.168.65.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:292173 errors:0 dropped:0 overruns:0 frame:0
TX packets:292173 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:64526655 (61.5 MiB) TX bytes:64526655 (61.5 MiB)
2.状态
[root@test5 app1]# masterha_check_status --conf=/etc/masterha/app1/app1.cnf
app1 (pid:14405) is running(0:PING_OK), master:192.168.65.2
3.停止
[root@test5 app1]# masterha_stop --conf=/etc/masterha/app1/app1.cnf
Stopped app1 successfully.
[1]+ Exit 1 nohup masterha_manager --conf=/etc/masterha/app1/app1.cnf > /etc/masterha/app1/mha_manager.log 2>&1
[root@test5 app1]# masterha_check_status --conf=/etc/masterha/app1/app1.cnf
app1 is stopped(2:NOT_RUNNING).
九、测试故障转移
1. master 宕机自动故障转移
a. kil 主库 mysql
[root@test2 mha]# pkill mysqld
[root@test2 mha]# ps -ef |grep mysql
root 13471 13086 0 10:19 pts/2 00:00:00 grep mysql
b. 查看 VIP
[root@test2 mha]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:DB:FC:D1
inet addr:192.168.65.2 Bcast:192.168.65.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fedb:fcd1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1883143 errors:0 dropped:0 overruns:0 frame:0
TX packets:1869929 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:320716057 (305.8 MiB) TX bytes:320778073 (305.9 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:292473 errors:0 dropped:0 overruns:0 frame:0
TX packets:292473 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:64558215 (61.5 MiB) TX bytes:64558215 (61.5 MiB)
[root@test3 mha]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:8F:0E:C5
inet addr:192.168.65.3 Bcast:192.168.65.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe8f:ec5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1821570 errors:0 dropped:0 overruns:0 frame:0
TX packets:1796331 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:301351935 (287.3 MiB) TX bytes:305148308 (291.0 MiB)
eth0:1 Link encap:Ethernet HWaddr 00:0C:29:8F:0E:C5
inet addr:192.168.65.21 Bcast:192.168.65.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:251432 errors:0 dropped:0 overruns:0 frame:0
TX packets:251432 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:56029214 (53.4 MiB) TX bytes:56029214 (53.4 MiB)
可以看到,这里 VIP 已经漂移到 192.168.65.3 上了(MHA 配置文件中定义了 192.168.65.3 是候选 master,所以会飘到 65.3上)
c. 查看 MHA Manager 日志
[root@test5 etc]# tail -f masterha/app1/manager.log
----- Failover Report -----
app1: MySQL Master failover 192.168.65.2(192.168.65.2:3306) to 192.168.65.3(192.168.65.3:3306) succeeded
Master 192.168.65.2(192.168.65.2:3306) is down!
Check MHA Manager logs at test5:/etc/masterha/app1/manager.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.65.2(192.168.65.2:3306)
Selected 192.168.65.3(192.168.65.3:3306) as a new master.
192.168.65.3(192.168.65.3:3306): OK: Applying all logs succeeded.
192.168.65.3(192.168.65.3:3306): OK: Activated master IP address.
192.168.65.4(192.168.65.4:3308): OK: Slave started, replicating from 192.168.65.3(192.168.65.3:3306)
192.168.65.3(192.168.65.3:3306): Resetting slave info succeeded.
Master failover to 192.168.65.3(192.168.65.3:3306) completed successfully.
PS:这里日志截图只是一部分,最后显示 192.168.65.3 已经成功切换为新的 master了
d.查看实际数据库复制状态
mysql> select @@report_host;
+---------------+
| @@report_host |
+---------------+
| 192.168.65.4 |
+---------------+
1 row in set (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.65.3
Master_User: rpl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: bin.000001
Read_Master_Log_Pos: 586
Relay_Log_File: relay.000002
Relay_Log_Pos: 787
Relay_Master_Log_File: bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
可以看到 192.168.65.4 的复制关系已自动转移到新 master上。
2.故障机器恢复后重新加入集群
# 如果old master的Executed_Gtid_Set大于新主库中对应UUID哪一个的事物号,那么就表示,新主库丢了数据了,这个时候你
需要自行从旧主库上提取这部分数据差异出来,然后与应用研发核对这部分数据是否需要补回新主库中,这个比较蛋疼,补完数据
之后,还得把旧主库的数据干掉重新做成从库;如果旧主库的GTID与新主库对应的UUID那一个GTID相等,那么恭喜你 ,表示数据
并没有丢失。
PS:如果你是使用的传统复制,那么后续旧主库恢复之后,有没有多数据需要去切换日志里边找 latest binary log
file/position记录,这个就是主库挂掉之后,其他从库所拥有的最新的对应旧主库的binlog pos,然后,使用mysqlbinlog在旧
主库上提取这个位置之后的数据,再人工确认这些数据是否要补到新主库上。
a.在 MHA Manager 节点执行下面语句找到 change master 语句
[root@test5 app1]# grep 'CHANGE MASTER TO' /etc/masterha/app1/manager.log |tail -1
Sat Jul 20 10:16:45 2019 - [info] All other slaves should start replication from here. Statement should be: CHANGE
MASTER TO MASTER_HOST='192.168.65.3', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='rpl',
MASTER_PASSWORD='xxx';
b.然后在 old master 上执行此语句(xxx换成复制用户的密码)
mysql> CHANGE
-> MASTER TO MASTER_HOST='192.168.65.3', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='rpl',
-> MASTER_PASSWORD='123';
Query OK, 0 rows affected, 2 warnings (0.04 sec)
mysql> start slave;
Query OK, 0 rows affected (0.01 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.65.3
Master_User: rpl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: bin.000001
Read_Master_Log_Pos: 586
Relay_Log_File: relay.000002
Relay_Log_Pos: 787
Relay_Master_Log_File: bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
可以看到复制重新建立,并指向新的 master。
3. master 未宕机手工故障转移
a.检查 MHA 运行状态
[root@test5 app1]# masterha_check_status --conf=/etc/masterha/app1/app1.cnf
app1 is stopped(2:NOT_RUNNING).
b.删除 app1.failover.complete 文件
[root@test5 app1]# rm -rf app1.failover.complete
PS:手工切换前提,
1.MHA Manager 必须是停止状态;
2.配置文件中 manager_workdir 这个参数的路径下,不能有 app1.failover.complete 这个文件,此
文件代表故障转移完成,完成的同时 MHA Manager 会中关闭;
c.在线切换
[root@test5 app1]# masterha_master_switch --conf=/etc/masterha/app1/app1.cnf --master_state=alive --
new_master_host=192.168.65.2 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=100
Sat Jul 20 10:50:24 2019 - [info] ok.
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] * Switching slaves in parallel..
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] -- Slave switch on host 192.168.65.4(192.168.65.4:3308) started, pid: 14930
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] Log messages from 192.168.65.4 ...
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] Waiting to execute all relay logs on 192.168.65.4(192.168.65.4:3308)..
Sat Jul 20 10:50:24 2019 - [info] master_pos_wait(bin.000001:586) completed on 192.168.65.4(192.168.65.4:3308). Executed 0
events.
Sat Jul 20 10:50:24 2019 - [info] done.
Sat Jul 20 10:50:24 2019 - [info] Resetting slave 192.168.65.4(192.168.65.4:3308) and starting replication from the new
master 192.168.65.2(192.168.65.2:3306)..
Sat Jul 20 10:50:24 2019 - [info] Executed CHANGE MASTER.
Sat Jul 20 10:50:24 2019 - [info] Slave started.
Sat Jul 20 10:50:24 2019 - [info] End of log messages from 192.168.65.4 ...
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] -- Slave switch on host 192.168.65.4(192.168.65.4:3308) succeeded.
Sat Jul 20 10:50:24 2019 - [info] Unlocking all tables on the orig master:
Sat Jul 20 10:50:24 2019 - [info] Executing UNLOCK TABLES..
Sat Jul 20 10:50:24 2019 - [info] ok.
Sat Jul 20 10:50:24 2019 - [info] Starting orig master as a new slave..
Sat Jul 20 10:50:24 2019 - [info] Resetting slave 192.168.65.3(192.168.65.3:3306) and starting replication from the new
master 192.168.65.2(192.168.65.2:3306)..
Sat Jul 20 10:50:24 2019 - [info] Executed CHANGE MASTER.
Sat Jul 20 10:50:24 2019 - [info] Slave started.
Sat Jul 20 10:50:24 2019 - [info] All new slave servers switched successfully.
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] * Phase 5: New master cleanup phase..
Sat Jul 20 10:50:24 2019 - [info]
Sat Jul 20 10:50:24 2019 - [info] 192.168.65.2: Resetting slave info succeeded.
Sat Jul 20 10:50:24 2019 - [info] Switching master to 192.168.65.2(192.168.65.2:3306) completed successfully.
这里只是部分输入,期间会输入 2 个 yes,意思分别是:
1.这里询问你是否需要在切换之前在当前主库上执行FLUSH NO_WRITE_TO_BINLOG TABLES语句;
2.确认是否开始执行切换;
可以看到 master 又切回 192.68.65.2 上了,
[root@test2 mha]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:DB:FC:D1
inet addr:192.168.65.2 Bcast:192.168.65.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fedb:fcd1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1883723 errors:0 dropped:0 overruns:0 frame:0
TX packets:1870450 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:320768727 (305.9 MiB) TX bytes:320892475 (306.0 MiB)
eth0:1 Link encap:Ethernet HWaddr 00:0C:29:DB:FC:D1
inet addr:192.168.65.21 Bcast:192.168.65.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:293046 errors:0 dropped:0 overruns:0 frame:0
TX packets:293046 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:64612191 (61.6 MiB) TX bytes:64612191 (61.6 MiB)
VIP 也回到了 192.168.65.2 上
masterha_master_switch 常用参数解释:
--orig_master_is_new_slave:切换时加上此参数是将原 master 变为 slave 节点,如果不加此参数,原来的 master 将不会作为从库
加入到新主库上,要使用这个参数,配置文件中必须配置repl_password参数,因为新主库并不知道旧主库的
复制帐号密码是什么 --running_updates_limit=100:故障切换时,候选master 如果有延迟的话, mha 切换不能成功,加上此参数表示延迟在此时间范围内
都可切换(单位为s),但是切换的时间长短是由recover 时relay 日志的大小决定、 --master_state=alive : 告诉mha,当前主库是活者的,对应的还有一个值是dead,即告诉mha当前主库死了 --new_master_host : 指定新的主库是谁 --new_master_port : 指定新主库的数据库端口
4.主库宕机在线切换
masterha_master_switch --master_state=dead --conf=/etc/masterha/app1/app1.conf --dead_master_host=192.168.65.2 --
dead_master_port=3306 --new_master_host=192.168.65.3 --new_master_port=3306 --ignore_last_failover --
remove_dead_master_conf
# 如果你是跟着文档流程做到这里,在切换过程中会有两个提示,输入yes回车即可
* Master 10.10.30.169(10.10.30.169:3306) is dead. Proceed? (yes/NO): yes #让你确认下主库是否真的死了
* Starting master switch from 10.10.30.169(10.10.30.169:3306) to 10.10.30.155(10.10.30.155:3306)? (yes/NO): yes #让你确认
是否开始
执行切换
# 切换成功之后,应该看到类似如下提示
Master failover to 10.10.30.155(10.10.30.155:3306) completed successfully.
# 常用参数解释
--remove_dead_master_conf :该参数代表当发生主从切换后,老的主库的ip将会从配置文件中移除。默认情况下,主库切换之后并不会
修改配置文件
--manger_log :日志存放位置
--ignore_last_failover : 在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以
这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会
在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存
在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为--
ignore_last_failover,另外,新版本0.56可以不使用这个参数,也不需要专门去删除app1.failover.complete文件
--dead_master_host 指定死掉的主库的IP
--dead_master_port 指定死掉的主库的数据库端口
# 查看app1.conf文件,会发现死掉的主库配置被删掉了
grep '10.10.30.169' /etc/masterha/app1/app1.conf
# 在线切换需要满足如下条件
* 所有从库的IO和SQL线程必须为运行状态
* 所有从库的Seconds_Behind_Master 的值必须小于--running_updates_limit 参数指定的数值
* 在主库上,show processlist出来的更新(个人觉得这里应该是指的对数据有变更操作的任何请求,而不仅仅是update)请求不能有超过-
-running_updates_limit 参数指定的数值的时间
其他参数:
--interactive=(0|1) 如果你想非交互式的执行切换,那么设置这个参数为0,默认是1,要使用该选项,你需要确定主库的状态是dead还是
online,或者其他程序调用masterha_master_switch切换命令时使用
--skip_change_master 0.56版本新增功能, 默认情况下,主库切换时,其他从库会change到新主库上,这个参数可以在从库数据恢复
之后,跳过change 到新主库以及跳过start slave,这样你就可以手动到从库上去再次确认数据有没有恢复成功
--skip_disable_read_only 0.56版本新增功能,默认情况下,主库切换到新主库时,新主库会执行set global read_only=0; 打开新主
库可读写,加上这个参数之后,跳过设置read_only=0语句
--wait_until_gtid_in_sync(0|1) 0.56版本新增的功能,默认为1,表示发生故障转移时,MHA会等待其他从库追赶上新主库的GTID位
置,如果设置为0,则不会等待
--ignore_binlog_server_error 0.56版本新增功能,表示在故障转移时,跳过任何binlog server的错误
--skip_lock_all_tables 默认情况下,在执行主库切换的时候,旧主库会执行flush table with read lock语句来确保主库停止写,但是执
行flush table with read lock是一个昂贵的操作,如果你能够人工判断并确认主库没有写操作,那么可以使用这个选项来跳过执行这个语
句。master_ip_online_change_script 这个参数并杀掉所有的客户端连接