mgr未同步 mysql_MySQL MGR 节点管理及故障切换

MySQL MGR 节点管理及故障切换

MySQL Group Replication(MGR)框架让MySQL具备了自动主从切换和故障恢复能力。

本文介绍节点管理及故障切换:

1)MGR节点管理及故障切换相关理论知识

2)测试在MGR单主模式下,master故障后,集群恢复情况;MGR集群添加/删除节点方法。

1. 测试环境

1.1 环境规划如下

角色

IP

port

server-id

DB-1

192.110.103.41

3106

103413106

DB-2

192.110.103.42

3106

103423106

DB-3

192.110.103.43

3106

103433106

说明:机器上port 3306已在使用被占了,只好改用3106。

1.2 安装MGR

2. MGR介绍

2.1 MGR简介

MySQL Group Replication(简称MGR)是MySQL5.7版本出现的新特性,提供高可用、高扩展、高可靠(强一致性)的MySQL集群服务。 MGR由多个实例节点共同组成一个数据库集群,系统提交事务必须经过半数以上节点同意方可提交,在集群中每个节点上都维护一个数据库状态机,保证节点间事务的一致性.

2.2 单主/多主模式

MySQL的组复制可配置为单主模型和多主模型两种工作模式。 以下是两种模式简介:

单主模型: 从复制组中多个MySQL节点中自动选举一个master节点,只有master节点可以写,其他节点自动设置为read only。当master节点故障时,会自动选举一个新的master节点,选举成功后,它将设置为可写,其他slave将指向这个新的master。

多主模型: 复制组中的任何一个节点都可以写,因此没有master和slave的概念,只要突然故障的节点数量不太多,这个多主模型就能继续可用。

MySQL组复制使用Paxos分布式算法来提供节点间的分布式协调。正因如此,它要求组中大多数节点在线才能达到法定票数,从而对一个决策做出一致的决定。

大多数指的是N/2+1(N是组中目前节点总数),例如目前组中有5个节点,则需要3个节点才能达到大多数的要求。所以,允许出现故障的节点数量如下图:

组大小

大多数数量

故障容忍数量

1

1

0

2

2

0

3

2

1

4

3

1

5

3

2

2.3 配置说明

[mysqld]

datadir=/data

socket=/data/mysql.sock

server-id=100 # 必须

gtid_mode=on # 必须

enforce_gtid_consistency=on # 必须

log-bin=/data/master-bin # 必须

binlog_format=row # 必须

binlog_checksum=none # 必须

master_info_repository=TABLE # 必须

relay_log_info_repository=TABLE # 必须

relay_log=/data/relay-log # 必须,如果不给,将采用默认值

log_slave_updates=ON # 必须

sync-binlog=1 # 建议

log-error=/data/error.log

pid-file=/data/mysqld.pid

transaction_write_set_extraction=XXHASH64 # 必须

loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" # 必须

loose-group_replication_start_on_boot=off # 建议设置为OFF

loose-group_replication_member_weigth = 40 # 非必需,mysql 5.7.20才开始支持该选项

loose-group_replication_local_address = "192.110.103.41:31061" #必须,下一行也必须

loose-group_replication_group_seeds = "192.110.103.41:31061,192.110.103.42:31061,192.110.103.43:31061"

loose-group_replication_bootstrap_group = OFF

loose-group_replication_single_primary_mode = FALSE # = multi-primary

loose-group_replication_enforce_update_everywhere_checks=ON # = multi-primary

复制代码

分析一下上面的配置选项:

1).因为组复制基于GTID,所以必须开启gtid_mode和enforce_gtid_consistency。

2).组复制必须设置log_bin和binlog_format。这样才能从日志记录中收集信息且保证数据一致性。

3).由于MySQL对复制事件校验的设计缺陷,组复制不能对他们校验,所以设置binlog_checksum=none。

4).组复制要将master和relay log的元数据写入到mysql.slave_master_info和mysql.slave_relay_log_info中。

5).组中的每个节点都保留了完整的数据副本,它是share-nothing的模式。所以所有节点上都必须开启log_slave_updates,这样新节点随便选哪个作为donor都可以进行异步复制。

6).sync_binlog=1是为了保证每次事务提交都立刻将binlog刷盘,故障也不丢失日志。

7).最后的6行是组复制插件的配置。以loose_开头表示即使启动组复制插件,MySQL也继续正常允许下去。这个前缀是可选的。

8).倒数第6行表示写集合以XXHASH64的算法进行hash。所谓写集,是对事务中所修改的行进行的唯一标识,在后续检测并发事务之间是否修改同一行冲突时使用。它基于主键生成,所以使用组复制,表中必须要有主键。

9).倒数第5行表示这个复制组的名称。它必须是一个有效的UUID值。在Linux下,可以使用uuidgen工具来生成UUID值。

3. 组复制管理

3.1 视图及视图切换

MGR以组视图(Group View,简称视图)为基础来进行成员管理。视图指Group在一段时间内的成员状态,如果在这段时间内没有成员变化,也就是说没有成员加入或退出,则这段连续的时间为一个视图,如果发生了成员加入或退出变化,则视图也就发生了变化,MGR使用视图ID(View ID)来跟踪视图的变化并区分视图的先后时间。

select * from performance_schema.replication_group_member_stats\G ;

*************************** 1. row ***************************

CHANNEL_NAME: group_replication_applier

VIEW_ID: 15718289704352993:39

MEMBER_ID: 509810ee-f3d7-11e9-a7d5-a0369fac2de4

COUNT_TRANSACTIONS_IN_QUEUE: 0

COUNT_TRANSACTIONS_CHECKED: 10

COUNT_CONFLICTS_DETECTED: 0

COUNT_TRANSACTIONS_ROWS_VALIDATING: 0

TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-60:1000003-1000006:2000003-2000006

LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:58

复制代码

Group视图ViewID由前缀时间戳+序号部分组成。

前缀部分:是在这个Group初始化时产生,为当时的时间戳,Group存活期间该值不变。

序号部分:Group初始化时,第一个视图序号为1,以后任何成员加入或退出序号都增一。

#viewid:15718289704352993:39

select from_unixtime(FLOOR(15718289704352993/10000000),'%Y-%m-%d %h:%m:%s');

+----------------------------------------------------------------------+

| from_unixtime(FLOOR(15718289704352993/10000000),'%Y-%m-%d %h:%m:%s') |

+----------------------------------------------------------------------+

| 2019-10-23 07:10:30 |

+----------------------------------------------------------------------+

复制代码

3.2 节点加入过程

使用网易云的MGR图来介绍过程:

一个节点请求加入Group时,其首先会根据配置的group_replication_group_seeds参数跟Group的种子成员建立TCP连接 (Gcs_xcom_control::do_join())。该种子成员会根据自己的group_replication_ip_whitelist(ip白名单)检查是否允许新节点加入,MGR默认不限制新节点的ip。连接建立后,新节点发送请求申请加入组。 收到请求后,种子成员广播视图变化的消息给Group中的所有节点,包括申请加入的节点,如右上所示;各节点收到消息后开始做视图切换。每个节点都会广播一个状态交换消息,每个交换消息包含了节点的当前状态和信息,如图左下所示。发送了交换消息后,各个节点开始接收其他节点广播的消息,将其中的节点信息更新到本节点所维护的成员列表中。

完成视图切换只是成员加入Group要做的第一步,只是说明该成员可以接收到Group中通过Paxos协议达成共识的消息,并不意味着可以将成员设置为ONLINE(上线)对外提供服务。原因是新成员还需要进行数据同步,建立起正确的数据版本(recovery_module->start_recovery)。之后才能执行Paxos协议消息,进而上线提供正常的用户访问服务。

3.3 组复制管理语句

常用操作组复制的语句。

SELECT * FROM performance_schema.replication_group_members;

#查看master

SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;

SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='group_replication_primary_member';

>SHOW STATUS LIKE 'group_replication_primary_member';

start group_replication;

stop group_replication;

复制代码

当要停止组中的某个成员中的组复制功能时,可在那个节点上执行stop group_replication语句。

但一定要注意,在执行这个语句之前,必须要保证这个节点不会向外提供MySQL服务,否则有可能会有新数据写入(例如主节点停止时),或者读取到过期数据。

故要安全地重启整个组,最佳方法是先停止所有非主节点的MySQL实例(不仅是停止组复制功能),然后停止主节点的MySQL实例。启动次序是先重启主节点,在这个节点上引导组,并启动它的组复制功能,最后再将各slave节点加入组。

组复制中,有两种节点离组的情况:自愿离组、非自愿离组。

自愿离组:

执行stop group_replication;语句。

1).执行该语句表示该节点自愿离组,它会触发视图自动配置,并将该视图更改操作复制到组内所有节点,直到大多数节点都同意新的视图配置,该节点才会离组。

2).节点自愿离组时,不会丢失法定票数。所以无论多少节点自愿离组,都不会出现"达不到大多数"的要求而阻塞组。

3).举个例子,5个节点的组,自愿退出一个节点A后,这个组的大小为4。这个组认为节点A从来都没有出现过。

非自愿离组:

除了上面自愿离组的情况,所有离组的情况都是非自愿离组。比如节点宕机,断网等等。

1).节点非自愿离组时,故障探测机制会检测到这个问题,于是向组中报告这个问题。然后会触发组视图成员自动配置,需要大多数节点同意新视图。

2).非自愿离组时,组的大小不会改变,无论多少个节点的组,节点非自愿退出后,组大小还是5,只不过这些离组的节点被标记为非ONLINE。

3).非自愿离组时,会丢失法定票数。 所以,当非自愿离组节点数量过多时,导致组中剩余节点数量达不到大多数的要求,组就会被阻塞。

4).举个例子,5节点的组,非自愿退出1个节点A后,这个组的大小还是5,但是节点A在新的视图中被标记为unreachable或其他状态。当继续非自愿退出2个节点后,组中只剩下2个ONLINE节点,这时达不到大多数的要求,组就会被阻塞。

4. MGR故障恢复

先简单介绍下MGR所涉及的数据复制及其通道(channel),MGR正常运行时,即Group中各节点均在线,节点间通过Paxos协议传输数据,异地事务经过认证后写入group_replication_applier Relay log中,由group_replication_applier通道的复制线程负责回放。

当有节点加入Group时,需要用到另一个复制通道group_replication_recovery,它是一个传统的Master-Slave异步复制通道。

我们可以把节点加入的过程划分为2个阶段,分为前一个视图数据恢复和本视图缓存事务执行。第一个阶段又可以细分为本地恢复和全局恢复。

4.1 视图数据恢复

1)本地恢复阶段

第一步:进行故障恢复初始化,包括故障恢复线程初始化,Group成员信息初始化等。

第二步:启动group_replication_applier复制通道。若不是新节点将这部分Relay log回放掉。

2)全局恢复

第三步:完成本地Relay log回放后,进入故障恢复第三步,即State transfer是通过group_replication_recovery复制通道从Group其他在线节点拉取本节点欠缺的非本视图的数据。与传统的Master-Slave复制配置不同,MGR中仅需为其配置账号和密码即可,配置方式形如:CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery'。显然,该通道是基于GTID以MASTER_AUTO_POSITION的方式进行。恢复线程随机选择一个在线节点(donor),调用rpl_slave.cc中的request_dump函数建立与donor的复制关系,该函数携带了恢复节点的gtid_executed信息。donor端会逆序遍历其Binlog文件,通过判断Binlog文件起始的Previous-GTIDs来找到第一个不属于gtid_executed的事务,从该事务开始进行数据复制。

4.2 缓存事务执行及退出恢复

3)缓存事务执行

第四步:先唤醒在第二步被阻塞的认证队列处理线程,让本节点逐步跟上Group中的其他节点,进而将其设置为在线(ONLINE)。

第五步:在达到了group_replication_recovery_complete_at所确定的条件后,发送Recovery_message::RECOVERY_END_MESSAGE消息,通知Group中的各个节点,将该节点设置为在线状态。

第六步:在故障恢复全过程中,若遇到了无法继续的情况或错误则先将该节点置为ERROR状态,确认该节点复制相关线程退出了,节点向Group发送leave信息退出Group,而不是置为ERROR状态候仍留在Group中。

第七步:不管是否出错,均需要重置故障恢复相关的参数,并销毁故障恢复线程。

5. MGR节点管理(增/删节点)

前面都是MGR理论知识,若没有时间可跳过阅读,直接依本节实践操作。

5.1 组中节点删除

组中节点删除较简单,执行stop group_replication语句,即可。

select * from performance_schema.replication_group_member_stats\G ;

select * from performance_schema.replication_group_members;

-- 停止

stop group_replication;

select * from performance_schema.replication_group_members;

复制代码

5.2 节点加入MGR集群

5.2.1 原MGR节点短时中断,恢复同步

若是以前的MGR某节点关停了,然后服务恢复了。通常只要start group_replication如下:

show global variables like '%seed%';

show global variables like '%group_repli%';

-- 启动组复制

set global group_replication_enforce_update_everywhere_checks=OFF;

set global group_replication_single_primary_mode=ON;

START GROUP_REPLICATION;

select * from performance_schema.replication_group_members;

复制代码

5.2.2 全新节点加入

1)安装MySQL软件

安装MySQL软件,详细步骤见官方参考。

#建立空的DB实例目录结构

mkdir -p /data1/mysql_3106/{data,etc,logs,tmp}

vim /data1/mysql_3106/etc/my.cnf #

# Group Replication parameter

plugin_load_add='group_replication.so'

transaction_write_set_extraction=XXHASH64

loose-group_replication_group_name = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"

loose-group_replication_start_on_boot = OFF

loose-group_replication_local_address = "192.110.103.42:31061"

loose-group_replication_group_seeds = "192.110.103.41:31061,192.110.103.42:31061,192.110.103.43:31061"

loose-group_replication_bootstrap_group = OFF

loose-group_replication_single_primary_mode = TRUE # FALSE : multi-primary

loose-group_replication_enforce_update_everywhere_checks= FALSE # ON : multi-primary

复制代码

2)备份master数据

#授权

#set GLOBAL super_read_only=0;

#grant all on *.* to backup@'127.0.0.1' identified by 'backup';

#备份

backup_path=/data1/mysqlbackup/`date +%Y%m%d`/mysql_3106

mkdir -p $backup_path

innobackupex --defaults-file=/data1/mysql_3106/etc/my.cnf -H 127.0.0.1 -P 3106 --user=backup --password=backup --parallel=4 --throttle=4 --no-timestamp --slave-info $backup_path/ >> $backup_path/mybackup.log 2>&1 &

#应用

innobackupex --defaults-file=/data1/mysql_3106/etc/my.cnf --apply-log $backup_path

cp /data1/mysql_3106/etc/my.cnf $backup_path/

复制代码

3)传输备份数据

scp -pr $backup_path 192.110.103.42:/data1/mysql_3106_backup

或用nc传输, 在slave接收方

nc -l 1234 | tar xzvf - > mysql_3106

#传输方

cd $backup_path/

tar czvf - mysql_3106 | nc 192.110.103.42 1234

复制代码

4)将备份数据恢复到slave

#建立空的DB实例目录结构

#mkdir -p /data1/mysql_3106/{data,etc,logs,tmp}

复制代码

修改配置文件:

cp /data1/mysql_3106_backup/my.cnf /data1/mysql_3106/etc/my.cnf

#

vim /data1/mysql_3106/etc/my.cnf #server_id等

server-id=103423106

loose-group_replication_start_on_boot = OFF

loose-group_replication_local_address = "192.110.103.42:31061"

loose-group_replication_group_seeds = "192.110.103.41:31061,192.110.103.42:31061,192.110.103.43:31061"

loose-group_replication_bootstrap_group = OFF

loose-group_replication_single_primary_mode = TRUE # FALSE : multi-primary

loose-group_replication_enforce_update_everywhere_checks= FALSE # ON : multi-primary

report_host=192.110.103.42

report_port=3106

复制代码

恢复备份数据:

#

innobackupex [--defaults-file=MY.CNF] --copy-back [--defaults-group=GROUP-NAME] BACKUP-DIR

innobackupex --defaults-file=/data1/mysql_3106/etc/my.cnf --copy-back /data1/mysql_3106_backup/

#

chown -R mysql:mysql /data1/mysql_3106

复制代码

5)启动MySQL

#启动MySQL

su - mysql -c "/usr/local/mysql/bin/mysqld_safe --defaults-file=/data1/mysql_3106/etc/my.cnf &"

复制代码

6)安装MGR插件

若没有安装MGR需要安装MGR插件,否则略过此步骤。

# 安装MGR插件

INSTALL PLUGIN group_replication SONAME 'group_replication.so';

SHOW PLUGINS;

复制代码

7)新节点启动MGR加入集群

加入MGR集群核心步骤:

新节点只需要如下几步就可加入MGR集群:

#新节点执行

cat xtrabackup_info -- 得到gtid值

binlog_pos = filename 'mysql-bin.xx', position 'xx', GTID of the last change 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-65:1000003-1000006:2000003-2000006'

#

RESET SLAVE ALL;

RESET MASTER;

#

set global gtid_purged='aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-65:1000003-1000006:2000003-2000006';

START GROUP_REPLICATION;

select * from performance_schema.replication_group_members;

复制代码

如下是详细的新节点加入MGR集群步骤。

1.新从库设置复制账号(可略)

若是从MGR集群用xtrabackup备份过来的新从库,该已有此复制账号。(可略此步骤)

SET SQL_LOG_BIN=0;

CREATE USER repl@'192.%.%.%' IDENTIFIED WITH 'mysql_native_password' BY 'repl';

GRANT SUPER,REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO repl@'192.%.%.%';

FLUSH PRIVILEGES;

SET SQL_LOG_BIN=1;

复制代码

2.查看备份时master最后执行的gtid

xtrabackup备份数据集却不备份binlog,所以必须先获取此次备份结束时的最后一个事务ID,并在slave上明确指定跳过这些事务,否则slave会再次从master上复制这些binlog并执行,导致数据重复执行。

可以从数据目录中的xtrabackup_info文件中获取。如果不是xtrabackup备份的,那么可以直接用备份前的master的show global variables like "gtid_executed";表示master中已执行过的事务,注意不是gtid_purged的值。

查看xtrabackup_info:

cat xtrabackup_info

uuid = b30fde6b-223e-11ea-bbbe-a0369fac2de4

tool_command = --defaults-file=/data1/mysql_3106/etc/my.cnf -H 127.0.0.1 -P 3106 --user=backup --password=... --parallel=4 --throttle=4 --no-timestamp --slave-info /data1/mysqlbackup/20191219/mysql_3106/

ibbackup_version = 2.4.8

server_version = 5.7.23-log

binlog_pos = filename 'mysql-bin.000008', position '1313', GTID of the last change 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-54:1000003-1000006:2000003-2000006'

复制代码

3.新从库change master

新的从库reset master,并change master到主库。

-- 在新从库上操作

show global variables like '%gtid%';

show master status;

RESET SLAVE ALL;

RESET MASTER; -- 将gtid_executed,gtid_purged变量置空

show global variables like '%gtid%';

+---------------------------------------------------+---------+

| Variable_name | Value |

+---------------------------------------------------+---------+

| binlog_gtid_simple_recovery | ON |

| enforce_gtid_consistency | ON |

| group_replication_allow_local_disjoint_gtids_join | OFF |

| group_replication_gtid_assignment_block_size | 1000000 |

| gtid_executed | |

| gtid_executed_compression_period | 1000 |

| gtid_mode | ON |

| gtid_owned | |

| gtid_purged | |

| session_track_gtids | OFF |

+---------------------------------------------------+---------+

10 rows in set (0.01 sec)

复制代码

设置gtid_purged为xtrabackup_info上的GTID值,这步很关键否则会出错,然后change master。

-- 设置gtid_purged为xtrabackup_info上的值,这步很关键.

set global gtid_purged='aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-54:1000003-1000006:2000003-2000006';

show global variables like '%gtid%' ;

-- change master

CHANGE MASTER TO MASTER_USER='repl', MASTER_PASSWORD='repl' FOR CHANNEL 'group_replication_recovery';

复制代码

4.启动组复制:(单主模式)

-- 启动组复制:

-- 查看当前情况

select * from performance_schema.replication_group_members;

select * from performance_schema.replication_group_member_stats\G ;

show global variables like 'group_replication%';

show global variables like 'group_replication_single%';

-- 设置单主模式

#第一节开启(group_replication_bootstrap_group=ON)引导功能后,一般会立即开启该节点的组复制功能来创建组.

SET GLOBAL group_replication_bootstrap_group=OFF; --第一节点为ON,其它节点为OFF

set global group_replication_enforce_update_everywhere_checks=OFF;

set global group_replication_single_primary_mode=ON;

set global group_replication_allow_local_disjoint_gtids_join=ON;

-- 启动组

START GROUP_REPLICATION;

-- 检查状态

select * from performance_schema.replication_group_members;

复制代码

注意:节点加入集群前最好不要修改任何数据, 否则就会出现下面的错误.

START GROUP_REPLICATION;

[ERROR] Plugin group_replication reported: 'This member has more > executed transactions than those present in the group. Local transactions: c3c274ff-c63e-11e7-> b339-00163e0c0288:1-4 > Group transactions: 2e6bfa69-0439-41c9-add7-795a9acfd499:1-10,

c5898a84-c63e-11e7-bc8b-00163e0af475:1-4'

#解决方法:

set global group_replication_allow_local_disjoint_gtids_join=ON; #兼容加入组

复制代码

5.2.3 数据不一致节点修复后加入

若是新增一个节点或恢复的Node 故障时间太长,master log已purge了,则不能直接START GROUP_REPLICATION; 恢复。需要自己手动在MGR集群中备份,MGR集群不存在SST和IST概念,而是通过GTID和binlog来实现“同步,追数据”的一个操作。

说明: 数据不一致节点复制异常,修复操作可以完全相同于上节的全新节点加入!

1.【现象】

192.110.103.42:3106 故障时间太长,MGR中master log已purge了,导致状态一直为RECOVERING或ERROR,无法加入MGR。

说明: MGR宕机节点会询问存活集群,是否能补全binlog?如果能补齐,那么就会正常传输,进行追数据 ;如果宕机节点需要的日志不存在了,则该节点无法正常加入到集群环境中。

root@192.110.103.42 : (none) > show variables like 'group_replication_group_seeds';

+-------------------------------+-------------------------------------------------------------+

| Variable_name | Value |

+-------------------------------+-------------------------------------------------------------+

| group_replication_group_seeds | 192.110.103.41:31061,192.110.103.42:31061,192.110.103.43:31061 |

+-------------------------------+-------------------------------------------------------------+

1 row in set (0.00 sec)

root@192.110.103.42 : (none) > start group_replication;

Query OK, 0 rows affected (3.35 sec)

root@192.110.103.42 : (none) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | RECOVERING |

| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

3 rows in set (0.00 sec)

root@192.110.103.42 : (none) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | ERROR |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

1 row in set (0.00 sec)

复制代码

错误日志:

日志提示:master has purged binary logs containing GTIDs that the slave requires.

2019-10-24T08:15:11.015032Z 198 [ERROR] Error reading packet from server for channel 'group_replication_recovery': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but t

he master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)

2019-10-24T08:15:11.015060Z 198 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CH

ANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236

2019-10-24T08:15:11.015076Z 198 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position 4

2019-10-24T08:15:11.015131Z 196 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'

2019-10-24T08:15:11.015164Z 199 [Note] Error reading relay log event for channel 'group_replication_recovery': slave SQL thread was killed

2019-10-24T08:15:11.015530Z 199 [Note] Slave SQL thread for channel 'group_replication_recovery' exiting, replication stopped in log 'FIRST' at position 0

2019-10-24T08:15:11.017239Z 196 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.110.103.41', master_port= 3106, master_log_file='', m

aster_log_pos= 4, master_bind=''. New state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.

2019-10-24T08:15:11.019165Z 196 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'

复制代码

2.【处理】

此种情况下,可以用xtrabackup重新制做从库,不过本节使用mysqldump来实现:

1)安装新DB实例

若是从库还是可以正常启动,此步可以略,只需要清理本地GTID信息,导入mysqldump的数据。

a)安装一台全新的MySQL(略)

b)也可以把当前从库,清空当前数据。

2)通过mysqldump制作从库

使用mysqldump来制作一份完整的数据备份。

mysqldump备份:

/usr/local/mysql/bin/mysqldump -h 127.0.0.1 -P3106 --all-databases --default-character-set=utf8 -R -q --triggers --master-data=2 --single-transaction > mysql3106_online.sql ;

复制代码

传输数据到从库:

scp -pr mysql3106_online.sql root@192.110.103.42:data1/

用nc传输

#slave接收方

nc -l 1234 | tar xzvf - >mysql3106_online.sql

#传输方

tar czvf - mysql3106_online.sql | nc 192.110.103.42 1234

复制代码

清理本地GTID信息 关闭同步进程、只读、并清理本地GTID信息。

-- slave 执行

show variables like 'group_replication_group_seeds';

STOP GROUP_REPLICATION;

RESET SLAVE ALL;

set global super_read_only=0;

show master logs;

reset master;

show master logs

show global variables like '%gtid%';

复制代码

导入备份数据:

#slave执行

mysql -h 127.0.0.1 -P3106 -uroot -pxxx --default-character-set=utf8 < mysql3106_online.sql

复制代码

3)安装MGR插件

安装MGR插件

INSTALL PLUGIN group_replication SONAME 'group_replication.so';

SHOW PLUGINS;

复制代码

4)启动恢复MGR:

show global variables like '%gtid%';

set global group_replication_enforce_update_everywhere_checks=OFF;

set global group_replication_single_primary_mode=ON;

START GROUP_REPLICATION;

复制代码

5)结果:

因为是新制作的从库,能正常同步追上数据。故可用START GROUP_REPLICATION;成功加入到MGR中。

START GROUP_REPLICATION;

select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE |

| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

复制代码

6. 故障切换测试

6.1 当前状态

当前状态为单主模式,状态如下:

select * from performance_schema.global_variables where VARIABLE_NAME

in ('group_replication_single_primary_mode','group_replication_enforce_update_everywhere_checks');

+----------------------------------------------------+----------------+

| VARIABLE_NAME | VARIABLE_VALUE |

+----------------------------------------------------+----------------+

| group_replication_enforce_update_everywhere_checks | OFF |

| group_replication_single_primary_mode | ON |

+----------------------------------------------------+----------------+

2 rows in set (0.00 sec)

select * from performance_schema.global_variables where VARIABLE_NAME like '%read_only';

+-----------------------+----------------+

| VARIABLE_NAME | VARIABLE_VALUE |

+-----------------------+----------------+

| innodb_read_only | OFF |

| read_only | OFF |

| super_read_only | OFF |

+-----------------------+----------------+

> SELECT * FROM performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE |

| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

复制代码

当前master:

只有单主模型的组复制才需要查找主节点,多主模型没有master/slave的概念,所以无需查找。

SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME='group_replication_primary_member';

+----------------------------------+--------------------------------------+

| VARIABLE_NAME | VARIABLE_VALUE |

+----------------------------------+--------------------------------------+

| group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 |

+----------------------------------+--------------------------------------+

#查看Master

SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

| VARIABLE_NAME | VARIABLE_VALUE | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

1 row in set (0.00 sec)

复制代码

6.2 主库故障

1)正常关停master

用kill命令关停master.

#Master 机器192.110.103.41上kill mysql 。

$ps aux|grep 3106

mysql 122456 0.0 0.0 106252 1444 ? S Oct23 0:00 /bin/sh /usr/local/mysql-5.7.23/bin/mysqld_safe --defaults-file=/data1/mysql_3106/etc/my.cnf

mysql 123471 0.1 0.8 11575792 1069584 ? Sl Oct23 1:45 /usr/local/mysql-5.7.23/bin/mysqld --defaults-file=/data1/mysql_3106/etc/my.cnf --basedir=/usr/local/mysql-5.7.23 --datadir=/data1/mysql_3106/data --plugin-dir=/usr/local/mysql-5.7.23/lib/plugin --log-error=/data1/mysql_3106/logs/mysqld.err --open-files-limit=8192 --pid-file=/data1/mysql_3106/tmp/mysql.pid --socket=/data1/mysql_3106/tmp/mysql.sock --port=3106

$kill 122456 123471 ; tail -f /data1/mysql_3106/logs/mysqld.err

复制代码

日志:

原DB Master日志

$tail -f /data1/mysql_3106/logs/mysqld.err

2019-10-24T03:10:32.746843Z 0 [Warning] /usr/local/mysql-5.7.23/bin/mysqld: Forcing close of thread 31 user: 'root'

2019-10-24T03:10:32.746873Z 0 [Note] Plugin group_replication reported: 'Plugin 'group_replication' is stopping.'

2019-10-24T03:10:32.746901Z 0 [Note] Plugin group_replication reported: 'Going to wait for view modification'

2019-10-24T03:10:35.797258Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

2019-10-24T03:10:40.799923Z 0 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'

2019-10-24T03:10:40.799954Z 0 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'

2019-10-24T03:10:40.800110Z 7 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed

2019-10-24T03:10:40.800431Z 7 [Note] Slave SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 65

2019-10-24T03:10:40.800652Z 4 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'

2019-10-24T03:10:40.800787Z 0 [Note] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'

2019-10-24T03:10:40.800799Z 0 [Note] Event Scheduler: Purging the queue. 0 events

2019-10-24T03:10:40.801278Z 0 [Note] Binlog end

2019-10-24T03:10:40.802272Z 0 [Note] Shutting down plugin 'group_replication'

2019-10-24T03:10:40.802322Z 0 [Note] Plugin group_replication reported: 'All Group Replication server observers have been successfully unregistered'

...

2019-10-24T03:10:42.804477Z 0 [Note] Shutting down plugin 'binlog'

2019-10-24T03:10:42.805238Z 0 [Note] /usr/local/mysql-5.7.23/bin/mysqld: Shutdown complete

2019-10-24T03:10:42.814933Z mysqld_safe mysqld from pid file /data1/mysql_3106/tmp/mysql.pid ended

复制代码

新DB Master日志:

$tail -n 30 /data1/mysql_3106/logs/mysqld.err

2019-10-23T11:11:00.671705Z 0 [Note] Plugin group_replication reported: 'XCom protocol version: 3'

2019-10-23T11:11:00.671736Z 0 [Note] Plugin group_replication reported: 'XCom initialized and ready to accept incoming connections on port 31061'

2019-10-23T11:11:05.400823Z 2 [Note] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.110.103.41:3106.'

2019-10-23T11:11:05.401138Z 20 [Note] Plugin group_replication reported: 'Establishing group recovery connection with a possible donor. Attempt 1/10'

2019-10-23T11:11:05.401143Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.42:3106 on view 15718289704352993:2.'

2019-10-23T11:11:05.402757Z 20 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.110.103.41', master_port= 3106, master_log_file='', master_log_pos= 4, master_bind=''.

2019-10-23T11:11:05.404717Z 20 [Note] Plugin group_replication reported: 'Establishing connection to a group replication recovery donor 509810ee-f3d7-11e9-a7d5-a0369fac2de4 at 192.110.103.41 port: 3106.'

2019-10-23T11:11:05.404998Z 22 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

2019-10-23T11:11:05.406423Z 22 [Note] Slave I/O thread for channel 'group_replication_recovery': connected to master 'repl@192.110.103.41:3106',replication started in log 'FIRST' at position 4

2019-10-23T11:11:05.442349Z 23 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './mysql-relay-bin-group_replication_recovery.000001' position: 4

2019-10-23T11:11:05.461483Z 20 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'

2019-10-23T11:11:05.461910Z 23 [Note] Slave SQL thread for channel 'group_replication_recovery' exiting, replication stopped in log 'mysql-bin.000002' at position 934

2019-10-23T11:11:05.462119Z 22 [Note] Slave I/O thread killed while reading event for channel 'group_replication_recovery'

2019-10-23T11:11:05.462143Z 22 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'mysql-bin.000002', position 934

2019-10-23T11:11:05.523357Z 20 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.110.103.41', master_port= 3106, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.

2019-10-23T11:11:05.526137Z 0 [Note] Plugin group_replication reported: 'This server was declared online within the replication group'

2019-10-23T11:15:33.426684Z 0 [Note] Plugin group_replication reported: 'Members joined the group: 192.110.103.43:3106'

2019-10-23T11:15:33.426832Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.42:3106, 192.110.103.43:3106 on view 15718289704352993:3.'

2019-10-23T11:15:34.094942Z 0 [Note] Plugin group_replication reported: 'The member with address 192.110.103.43:3106 was declared online within the replication group'

2019-10-24T03:10:32.839967Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: 192.110.103.41:3106'

2019-10-24T03:10:32.839985Z 0 [Note] Plugin group_replication reported: 'Primary server with address 192.110.103.41:3106 left the group. Electing new Primary.'

2019-10-24T03:10:32.840052Z 0 [Note] Plugin group_replication reported: 'A new primary with address 192.110.103.42:3106 was elected, enabling conflict detection until the new primary applies all relay logs.'

2019-10-24T03:10:32.840086Z 41 [Note] Plugin group_replication reported: 'This server is working as primary member.'

2019-10-24T03:10:32.840107Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.42:3106, 192.110.103.43:3106 on view 15718289704352993:4.'

2019-10-24T03:12:01.677869Z 4 [Note] Plugin group_replication reported: 'Primary had applied all relay logs, disabled conflict detection'

复制代码

DB自动切换:

可以看到DB已顺利完成自动切换。

root@192.110.103.42 : (none) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE |

| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

2 rows in set (0.00 sec)

root@192.110.103.42 : (none) > SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

| VARIABLE_NAME | VARIABLE_VALUE | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_primary_member | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE |

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

1 row in set (0.00 sec)

复制代码

原master恢复,重新加入机群:

原master恢复,START GROUP_REPLICATION;

> select @@group_replication_bootstrap_group;

+-------------------------------------+

| @@group_replication_bootstrap_group |

+-------------------------------------+

| 0 |

+-------------------------------------+

> START GROUP_REPLICATION;

> select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE |

| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

复制代码

2)异常关停master

异常关停kill -9 master节点,192.110.103.42。从日志中可以看到自动选主,并主变成为192.110.103.41。

#kill -9 master节点,192.110.103.42上执行.

$kill -9 mysql_pid

#192.110.103.41机器上日志

tail -f /data1/mysql_3106/logs/mysqld.err

2019-10-24T06:17:31.473849Z 0 [Warning] Plugin group_replication reported: 'Member with address 192.110.103.42:3106 has become unreachable.'

2019-10-24T06:17:32.479299Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: 192.110.103.42:3106'

2019-10-24T06:17:32.479323Z 0 [Note] Plugin group_replication reported: 'Primary server with address 192.110.103.42:3106 left the group. Electing new Primary.'

2019-10-24T06:17:32.479395Z 0 [Note] Plugin group_replication reported: 'A new primary with address 192.110.103.41:3106 was elected, enabling conflict detection until the new primary applies all relay logs.'

2019-10-24T06:17:32.479439Z 37 [Note] Plugin group_replication reported: 'This server is working as primary member.'

2019-10-24T06:17:32.479465Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.43:3106 on view 15718289704352993:6.'

复制代码

root@192.110.103.41 : (none) > select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE |

+---------------------------+--------------------------------------+---------------+-------------+--------------+

2 rows in set (0.00 sec)

root@192.110.103.41 : (none) > SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

| VARIABLE_NAME | VARIABLE_VALUE | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

| group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE |

+----------------------------------+--------------------------------------+---------------+-------------+--------------+

1 row in set (0.00 sec)

复制代码

参考:

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值