MGR 问题处理

magiclight007

已于 2023-06-21 14:50:38 修改

阅读量616

点赞数

分类专栏： MGR 文章标签： mysql 数据库 java

于 2023-06-06 13:01:03 首次发布

本文链接：https://blog.csdn.net/chenglong411/article/details/131065343

版权

MGR 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

问题1

MGR集群故障，大多数处于offline状态，如三个节点中2个或3个节点处于offline状态，集群不可用

故障修复：
1 三个MySQL节点依次执行如下命令，停止组复制 stop group_replication;

2 然后依次启动每个节点的组复制
# 第一个节点执行如下：
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;

# 后面两个节点再依次启动组复制
CHANGE MASTER TO MASTER_USER='repl', MASTER_PASSWORD='XXX' FOR CHANNEL 'group_replication_recovery';
start group_replication;

# 查看组复制状态
SELECT * FROM performance_schema.replication_group_members;

问题2

#MGR 节点脱离集群

SELECT * FROM performance_schema.replication_group_members;
SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;
stop group_replication;
change master to master_user='repl', master_password='PASSXXX' for channel 'group_replication_recovery';
start group_replication;
select * from performance_schema.replication_group_members;
将 group_replication_unreachable_majority_timeout 设置为 N 秒，需要等到 N+5 秒后，节点才变为 ERROR 状态并脱离集群。因此为了防止网络抖动导致的节点脱离集群，可适当调大这个值。对于 >=8.0.16 的 mysql 版本，可以使用 group_replication_autorejoin_tries 参数进一步提升 MySQL MGR 的容错能力和高可用能力。

问题3

错误日志出现大量 Multi-threaded slave: Coordinator has waited 231081 times hitting slave_pending_jobs_size_max; current event size = 8116

解决方式：调整slave_pending_jobs_size_max参数的大小

show global variables like 'slave_pending_jobs_size_max';
set global slave_pending_jobs_size_max = 16777216 * 32;

slave_pending_jobs_size_max的用途：

在多线程复制时，在队列中Pending的事件所占用的最大内存，默认为16M，如果内存富余，或者延迟较大时可以适当调大; 注意这个值要比主库的max_allowed_packet大

问题3.1

向MGR 中导入mysqldump的数据文件(>10GB)时可能出现的问题

1）可能超过事务大小限制group_replication_transaction_size_limit (默认143MB)，导致写入失败

2）可能导致集群不可用，原因是：MGR的数据传输和心跳使用的同一个通道，大规模数据传输可能导致心跳超时，造成集群不可用

参考说MGR - 单独的通信机制_mgr 涉及端口放通_Data & safety的博客-CSDN博客

问题4
[ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
解决方案：网络问题，等网络通畅之后重新加入组即可

问题5
[ERROR] Failed to open the relay log './localhost-relay-bin.000011' (relay_log_pos ).
[ERROR] Could not find target log file mentioned in relay log info in the index file './work_NAT_1-relay-bin. index' during relay log initialization.
[ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
[ERROR] Failed to open the relay log './localhost-relay-bin-group_replication_recovery.000001' (relay_log_pos ).
[ERROR] Could not find target log file mentioned in relay log info in the index file './work_NAT_1-relay-bin-group_replication_recovery.index' during relay log initialization.
[ERROR] Slave: Failed to initialize the master info structure for channel 'group_replication_recovery'; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
[ERROR] Failed to create or recover replication info repositories.
[ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code:
[ERROR] /usr/local/mysql/bin/mysqld: Slave failed to initialize relay log info structure from the repository
[ERROR] Failed to start slave threads for channel ''
解决方案：slave节点找不到relay-log的位置了，需要重新reset slave

问题6

the master has purged binary logs containing GTIDs that the slave requires
出现场景：从库脱离群组时间太长，无法通过已有的 binlog 进行恢复了
解决方法：重做数据（mysqldump 或 xtrabackup）后，加入群组

主节点crash时候的恢复

主节点数据还在、宕机期间集群中的其它结点的binlog日志也都还在
重启原来的主节点，并重新开始组复制
START GROUP_REPLICATION;

主结点的数据都没有了
从其余结点备份恢复crash的主结点、然后再重启group replication

问题7
其他节点启动MGR时，执行命令start group_replication;
报错
[ERROR] Plugin group_replication reported: 'There was an error when trying to access the server with user: mysql.session. Make sure the user is present in the server and that mysql_upgrade was run after a server update.'
解决：
/usr/local/mysql/bin/mysql_upgrade -S /data/mysql3308/run/mysql.sock -udbadmin -p

问题8
其他节点启动MGR时，执行命令start group_replication;
报错 [ERROR] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33081'
解决办法：配置白名单
loose-group_replication_ip_whitelist="10.xx.0.0/16,127.0.0.1/8"

2022-09-01T08::31.088437Z [Warning] Plugin group_replication reported: '[GCS] Connection attempt from IP address 192.168.9.208 refused. Address is not in the IP whitelist.'
解决方法：配置group_replication_ip_whitelist

问题9
START GROUP_REPLICATION;

ERROR (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.
#查看log文件
warning：2022-09-01T07::30.233937Z [Warning] Plugin group_replication reported: 'Group Replication requires slave-preserve-commit-order to be set to ON when using more than 1 applier threads.
解决方案： set global slave_preserve_commit_order=1;

问题10
2022-09-01T08::44.087492Z [Warning] Plugin group_replication reported: 'read failed'
2022-09-01T08::44.096171Z [ERROR] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 24801'
2022-09-01T08::14.065775Z [ERROR] Plugin group_replication reported: 'Timeout on wait for view after joining group
解决方案：
将my.cnf中的参数group_replication_group_seeds设置为只包含除自身外其他group成员的ip地址以及内部通信端口，如果写成group所有成员的IP地址则会出现这个错误，这和相同网段的MGR部署方式有些差异。

问题11

[ERROR] Plugin group_replication reported: ‘[GCS] Error on opening a connection to oceanbase07: on local port: '.’
解决方案：
未开通防火墙上的固定端口，开通防火墙之后即可解决

问题12

MGR网络抖动问题分析和group_replication_unreachable_majority_timeout等参数分析

https://www.jianshu.com/p/935793aa51b0

问题13
[Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[ERROR] Slave I/O for channel 'group_replication_recovery': Master command COM_REGISTER_SLAVE failed: Access denied for user 'rpl_user'@'%' (using password: YES) (Errno: 1045), Error_code: 1597
[ERROR] Slave I/O thread couldn't register on master
[Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position
解决方案：在group节点上执行
CREATE USER rpl_user@‘%’;
GRANT REPLICATION SLAVE ON . TO rpl_user@‘%’ IDENTIFIED BY ‘rpl_pass’;

问题14
[ERROR] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: bb874065-c485-11e8-8b52-000c2934472e:1 > Group transactions: 3db33b36-0e51-409f-a61d-c99756e90155:1-11'
[ERROR] Plugin group_replication reported: 'The member contains transactions not present in the group. The member will now exit the group.'[Note] Plugin group_replication reported: ‘To force this member into the group you can use the group_replication_allow_local_disjoint_gtids_join option’
解决方案：有些成员有组内其它成员没有的事务，set global group_replication_allow_local_disjoint_gtids_join=ON;

或者reset master + set gtid_purged 然后重新开始组复制

问题15
[ERROR] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: bb874065-c485-11e8-8b52-000c2934472e:1 > Group transactions: 3db33b36-0e51-409f-a61d-c99756e90155:1-15'
[Warning] Plugin group_replication reported: 'The member contains transactions not present in the group. It is only allowed to join due to group_replication_allow_local_disjoint_gtids_join option'
[Note] Plugin group_replication reported: 'This server is working as secondary member with primary member address localhost.localdomaion:3306.'
解决方案：
参数group_replication_allow_local_disjoint_gtids_join已经设置成为on了。解决该问题的方法是执行reset master，然后重新在主节点和从节点开启通道
CHANGE MASTER TO MASTER_USER=‘rpl_user’, MASTER_PASSWORD=‘rpl_pass’ FOR CHANNEL ‘group_replication_recovery’;

问题16
[ERROR] Slave I/O for channel 'group_replication_recovery': error connecting to master 'rpl_user@localhost.localdomaion:' - retry-time: 60 retries: 1, Error_code: 2005
[ERROR] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
[ERROR] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'
[Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt /'
解决方案：测试环境上三台主机的hostname设置成了同一个名称，改了hostname之后就解决了

问题17

MySQL 5.7.22 之前，MGR可能会出现数据不一致的风险。这种情况发生在成员短暂离组，在组感知前，自己又重新加入组内的时候，官方文档描述如下：

In this situation, the rejoining member forgets its previousstate, but if other members send it messages that are intended for itspre-crash state, this can cause issues including possible data inconsistency.

To counter this possibility, from MySQL 5.7.22, servers aregiven a unique identifier when they join a group. This enables GroupReplication to be aware of the situation where a new incarnation of the sameserver (with the same address but a new identifier) is trying to join the groupwhile its old incarnation is still listed as a member. The new incarnation isblocked from joining the group until the old incarnation can be removed by areconfiguration. If Group Replication is stopped and restarted on the server,the member becomes a new incarnation and cannot rejoin until the suspiciontimes out.