常见主从复制报错处理案例

一、主从复制原理

对于MySQL的主从复制,首先我们需要知道以下几点:

1、开启主从复制的必要条件

1)主库开启log-bin,binlog会记录主库所有变更操作,该日志是主从复制的核心日志。
2)主/从库server-id不一致,server-id是数据库的唯一标识,若主从数据库server-id一致将无法创建主从复制关系

2、主从复制的大致流程

1)当两个数据库构建成为主从复制关系时,从库会启动IO线程和SQL线程;主库启动相关的dump线程。
2)主库发生的所有变更操作都会记录到binlog日志中
3)当主库发生变更时,主库的dump线程会通知从库的IO线程,IO线程根据具体的binlog文件以及位点信息将对应的日志写入到从库的relay log中;
4)从库的SQL线程解析应用relay log日志进行回放,以保证从库数据与主库一致

二、基本排查手段

1、复制同步信息

mysql> show slave status\G

重点关注参数:

Master_Log_File: IO线程读取到的binlog file文件
Read_Master_Log_Pos: IO线程读取到的binlog file文件的位点信息
Relay_Log_File: SQL线程当前正在应用的relay log文件
Relay_Log_Pos: SQL线程当前应用relay log文件的位点信息
Relay_Master_Log_File: SQL线程当前应用记录对应的主库binlog file文件
Exec_Master_Log_Pos: SQL线程当前应用记录对应主库binlog file文件的位点信息
Slave_IO_Running: IO线程状态
Slave_SQL_Running: SQL线程状态
Last_SQL_Errno: 主从复制中断报错编码
Last_SQL_Error: 主从复制报错具体信息
Seconds_Behind_Master: 一定程度上反应了主从复制延迟
Master_UUID: 主库的uuid
Retrieved_Gtid_Set: 当前IO线程读取到的gtid集
Executed_Gtid_Set: 当前SQL线程应用到的gtid集

2、数据库错误日志

mysql> show variables like 'log_error';
+---------------+------------------------------+
| Variable_name | Value                        |
+---------------+------------------------------+
| log_error     | /data/mysql57/logs/error.log |            //在数据库中查看error log路径
+---------------+------------------------------+
1 row in set (0.00 sec)

3、系统日志

# dmesg -T

三、常见报错案例

3.1 Last_Errno - 1032

1、具体报错

root@mysql57 17:17:  [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
              Master_Log_File: binlog.000006
          Read_Master_Log_Pos: 230
               Relay_Log_File: relaylog.000002
                Relay_Log_Pos: 2024
        Relay_Master_Log_File: binlog.000005
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 1905
              Relay_Log_Space: 3518
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table db1.t1; Unknown error 1032, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000005, end_log_pos 2126
             Master_Server_Id: 33061
                  Master_UUID: 1d24c492-83eb-11ea-86cd-000c2913f5b2
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 1d24c492-83eb-11ea-86cd-000c2913f5b2:5-12
            Executed_Gtid_Set: 1d24c492-83eb-11ea-86cd-000c2913f5b2:1-11,
346cd00d-8ea9-41b7-8fea-67a6a483470f:1-24,
66c8918d-83eb-11ea-9d7d-000c29242ae2:1-2
                Auto_Position: 1
1 row in set (0.00 sec)

2、问题排查

从Last_SQL_Errno、Last_SQL_Error中我们可以看到,导致主从复制中断是因为无法执行一个UPDATE操作,所以我们首先需要根据主从复制以及报错信息的BINLOG FILE以及POSITION位点信息,找到对应的UPDATE具体语句,并且在数据库中排查具体什么原因导致该语句执行报错。

1)定位问题SQL

根据主从复制报错提供的信息,我们有两种方法去定位具体的慢SQL。方法一是根据Relay_Master_Log_File、Exec_Master_Log_Pos、Last_SQL_Error提供出来的信息,在master的日志目录下通过位点找到对应SQL;方法二是根据Relay_Log_File、Relay_Log_Pos在slave的日志目录下通过位点找到对应SQL。

# 通master寻找问题SQL
# /usr/local/mysql57/bin/mysqlbinlog -vv binlog.000005 --start-position=1905 --stop-position=2126
### UPDATE `db1`.`t1`
### WHERE
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2=NULL /* VARSTRING(80) meta=80 nullable=1 is_null=1 */
###   @3=NULL /* INT meta=0 nullable=1 is_null=1 */
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='cc' /* VARSTRING(80) meta=80 nullable=1 is_null=0 */
###   @3=23 /* INT meta=0 nullable=1 is_null=0 */

2)在复制报错的slave端查看具体表错的表数据,可以发现该update语句更新的记录在slave中根本不存在,从而导致update操作报错。slave若出现该类型相关的报错,一般都是由于主从数据不一致,导致主库binlog传输到从库,从库应用时无法匹配到对应记录而报错。

root@mysql57 23:00:  [db1]> select * from `db1`.`t1`;
+----+------+------+
| id | name | age  |
+----+------+------+
|  2 | aa   |   12 |
|  3 | bb   |   14 |
|  4 | aa   |   12 |
+----+------+------+
3 rows in set (0.00 sec)

3、问题处理

对于主从数据不一致而产生的复制中断,我们可以根据具体报错场景判断是否可以直接跳过报错还是跳过报错后仍然需要手动订正相关记录。本次复制报错中我们需要做的就是跳过报错,手动订正该记录,保证主从数据一致性。

mysql> stop slave;
mysql> set gtid_next = '1d24c492-83eb-11ea-86cd-000c2913f5b2:12';
mysql> begin;commit;
mysql> set gtid_next = 'AUTOMATIC';

# 重新开启主从同步并检查复制状态
mysql> start slave;
mysql> show slave status\G


# 订正数据
mysql> set sql_log_bin=0;
mysql> insert into t1 values(1,'cc',23);
mysql> set sql_log_bin=1;

3.2 Last_Errno - 1396

1、具体报错

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
              Master_Log_File: mysql-bin.000055
          Read_Master_Log_Pos: 541873929
               Relay_Log_File: relaylog.000177
                Relay_Log_Pos: 719899849
        Relay_Master_Log_File: mysql-bin.000053
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 719899639
              Relay_Log_Space: 45383924486
               Last_SQL_Errno: 1396
               Last_SQL_Error: Error 'Operation CREATE USER failed for 'brc_acrm'@'%'' on query. Default database: ''. Query: 'CREATE USER 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7''
                  Master_UUID: 42a444a1-c303-11e8-952b-005056901fca
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 42a444a1-c303-11e8-952b-005056901fca:1-177094
            Executed_Gtid_Set: 42a444a1-c303-11e8-952b-005056901fca:1-172612,
d8ea4112-c303-11e8-952f-00505690f262:1-12
                Auto_Position: 1
1 row in set (0.00 sec)

2、问题排查

从Last_SQL_Errno以及Last_SQL_Error中可以看到具体执行报错操作是一个创建用户的相关动作,根据该报错我们推断去查看mysql.user表中是否可以排查到一些有用的线索。

从以下SQL结果中我们可以看到,主库binlog传输过来一个创建brc_acrm@%用户的操作,但是从库此时已经存在brc_acrm@%用户的信息,所以导致该create user操作执行报错,主从复制中断原因基本定位

mysql> select user,host from mysql.user where user='brc_acrm';
+----------+-----------+
| user     | host      |
+----------+-----------+
| brc_acrm | %         |
+----------+-----------+
1 rows in set (0.00 sec)

3、问题处理

1)对比主从数据库下该用户的授权情况,从授权来看目前主从库是完全一致的,猜测可能是开发同学通过超级账号手动在主/从库创建用户导致。

主:
mysql> show grants for 'brc_acrm'@'%';
+---------------------------------------------------------------------------------------------------------+
| Grants for brc_acrm@%                                                                                   |
+---------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7' |
+---------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)

从:
mysql> show grants for 'brc_acrm'@'%';
+---------------------------------------------------------------------------------------------------------+
| Grants for brc_acrm@%                                                                                   |
+---------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7' |
+---------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

2)针对以上情况,我们可以选择跳过当前GTID事务来恢复主从复制,具体操作如下:

# 关闭主从同步
mysql> stop slave;

# 跳过错误
mysql> set gtid_next = '42a444a1-c303-11e8-952b-005056901fca:172613';
mysql> begin;commit;
mysql> set gtid_next = 'AUTOMATIC';

# 重新开启主从同步并检查复制状态
mysql> start slave;
mysql> show slave status\G

3.3 Last_Errno - 1418

1、具体报错

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
              Master_Log_File: mysql-bin.000068
          Read_Master_Log_Pos: 230
               Relay_Log_File: relay-log.000178
                Relay_Log_Pos: 16477419
        Relay_Master_Log_File: mysql-bin.000064
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 240368941
              Relay_Log_Space: 18473673            
               Last_SQL_Errno: 1418
               Last_SQL_Error: Error 'This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary logging is enabled (you *might* want to use the less safe log_bin_trust_function_creators variable)' on query. Default database: 'electric'. Query: 'CREATE DEFINER=`ydan-user`@`%` FUNCTION `GetAccess`(id varchar(50)) RETURNS varchar(50) CHARSET utf8
BEGIN
	declare num int;
  declare result varchar(10);
  select count(*) into num from ou_userrole where userid=id and ROLEID like 'leader%'  ;
  if num>0 then set result='access';
	else set result ='no';
  end if;
  return result;
END'
             Master_Server_Id: 85857738
                  Master_UUID: d3c62cfa-7e24-11ea-8e2e-faf8ce492e00
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: d3c62cfa-7e24-11ea-8e2e-faf8ce492e00:1-1982
            Executed_Gtid_Set: d3bf96b2-7e24-11ea-8fb1-fa38f0896800:1-10708449,
d3c62cfa-7e24-11ea-8e2e-faf8ce492e00:1-688
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

2、问题排查

从Last_SQL_Errno、Last_SQL_Error中可以看到,本次主从同步中断具体报错的是一个函数创建的报错。由于master与slave都是开启了log-bin参数的,在该情况下master、slave中的函数对数据一致性存在一定的隐患,所以在不明确函数是否会影响数据一致性的情况下,MySQL默认会阻止这类操作。

mysql> show variables like 'log_bin_trust_function_creators';
+---------------------------------+-------+
| Variable_name                   | Value |
+---------------------------------+-------+
| log_bin_trust_function_creators | OFF   |         //该参数默认为OFF,表示开启log-bin的slave阻止创建未明确指定类型的函数
+---------------------------------+-------+
1 row in set (0.01 sec)

# 函数创建指定类型有
DETERMINISTIC       //不确定是否会影响数据一致性,
NO SQL              //没有SQl语句,保证不影响数据一致性
READS SQL DATA      //只是读取数据,保证不影响数据一致性

3、问题处理

mysql> stop slave;
mysql> set sql_log_bin=0;
mysql> set global log_bin_trust_function_creators=TRUE;
mysql> 手动执行对应的存储过程
mysql> set global log_bin_trust_function_creators=0;  
mysql> set sql_log_bin=1;

3.4 Last_Errno - 1594

1、具体报错

1) MySQL 复制信息

mysql> show slave status\G
*************************** 1. row ***************************
              Master_Log_File: mysql-bin.000400
          Read_Master_Log_Pos: 363471322
               Relay_Log_File: relay-log.000236
                Relay_Log_Pos: 197156629
        Relay_Master_Log_File: mysql-bin.000397
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 197156408
              Relay_Log_Space: 1936343145
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 19670500
                  Master_UUID: 5d4178f0-d6eb-11e9-bf91-0242f977238f
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 5d4178f0-d6eb-11e9-bf91-0242f977238f:398658257-399871744:399871746-496665236,
5e07d619-d6eb-11e9-beff-0242e77c7e67:145857044-146322581
            Executed_Gtid_Set: 012ef0fc-4ae3-11e9-8406-28a6db6245e4:1-31634397,
5d4178f0-d6eb-11e9-bf91-0242f977238f:1-491760413,
5e07d619-d6eb-11e9-beff-0242e77c7e67:1-146297997,
e807c5a5-4ae2-11e9-845f-2c55d3e93d14:1-50267012
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

2)error log

2020-07-27T02:56:05.036042Z 2 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 100, event_type: 31
2020-07-27T02:56:05.036069Z 2 [ERROR] Error reading relay log event for channel '': slave SQL thread aborted because of I/O error
2020-07-27T02:56:05.036096Z 2 [ERROR] Slave SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (y
ou can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in
 the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error
_code: 1594
2020-07-27T02:56:05.036109Z 2 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000397' 
position 197156408.
2020-07-29T15:43:29.114678Z 74970 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using 
the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-07-29T15:43:29.686895Z 74970 [ERROR] mysqld: Binary logging not possible. Message: Either disk is full or file system is read only while rotating the binlog. Aborting the server.
15:43:29 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

2、问题排查

1)从主从复制信息中我们可以发现复制中断是因为SQL线程无法解析relay log,relay可能由于某些原因导致损坏,对于这种情况我们可以手动通过mysqlbinlog工具及主从复制中具体的报错位点信息对relay log进行解析,确认该报错情况。

2)继续查看error log,从error log中我们可以发现本次导致relay log损坏的原因可能是由于I/O

3)一般可能会导致relay log损坏的情况有:数据库重启、网络抖动、空间不足日志无法写入等

3、问题处理

对于relay log无法正常解析的报错,我们需要重新指定主从复制来进行恢复。若主从复制使用的是非GTID模式可以根据error log提示的位点信息重新change master,若使用的是GTID模式,那我们只需要指定MASTER_AUTO_POSITION=1即可,具体操作如下:

mysql> stop slave;
mysql> CHANGE MASTER TO
  MASTER_HOST='172.26.44.1',
  MASTER_USER='rds_repl',
  MASTER_PASSWORD='xxx',
  MASTER_PORT=3044,
  MASTER_AUTO_POSITION=1;
mysql> start slave;

3.5 Last_Errno - 1872

1、具体报错

1) MySQL 主从复制报错

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
              Master_Log_File: mysql-bin.000402
          Read_Master_Log_Pos: 311570312
               Relay_Log_File: relay-log.000124
                Relay_Log_Pos: 311570536
        Relay_Master_Log_File: mysql-bin.000402
             Slave_IO_Running: No
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 311570251
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository
                  Master_UUID: 5d4178f0-d6eb-11e9-bf91-0242f977238f
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 5d4178f0-d6eb-11e9-bf91-0242f977238f:448797409-499521040,
5e07d619-d6eb-11e9-beff-0242e77c7e67:146080492-146337012
            Executed_Gtid_Set: 012ef0fc-4ae3-11e9-8406-28a6db6245e4:1-31634397,
5d4178f0-d6eb-11e9-bf91-0242f977238f:1-499521040,
5e07d619-d6eb-11e9-beff-0242e77c7e67:1-146337012,
d00323ab-8ce0-11e9-99bc-0242065ece34:1-32,
e807c5a5-4ae2-11e9-845f-2c55d3e93d14:1-50267012
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

2)error log

2020-07-31T03:51:03.357467Z 0 [ERROR] Failed to open the relay log '/data/mysql/binlog/relay-log.000124' (relay_log_pos 311570536).
2020-07-31T03:51:03.357484Z 0 [ERROR] Could not find target log file mentioned in relay log info in the index file '/data/mysql/binlog/relay-log.index' during relay log initialization.
2020-07-31T03:51:03.409170Z 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2020-07-31T03:51:03.409211Z 0 [ERROR] Failed to create or recover replication info repositories.
2020-07-31T03:51:03.409247Z 0 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T03:51:03.409259Z 0 [ERROR] mysqld: Slave failed to initialize relay log info structure from the repository
2020-07-31T03:51:03.409267Z 0 [ERROR] Failed to start slave threads for channel ''
2020-07-31T04:55:43.905528Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T04:55:50.286848Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T04:57:22.006869Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872

3)系统日志

Jul 31 11:51:31 dbinstance3 dockerd: time="2020-07-31T11:51:31.757931233+08:00" level=info msg="Container 53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681 failed to exit within 10 seconds of signal 15 - using the force"
Jul 31 11:51:34 dbinstance3 containerd: time="2020-07-31T11:51:34.110864629+08:00" level=info msg="shim reaped" id=53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681
Jul 31 11:51:34 dbinstance3 dockerd: time="2020-07-31T11:51:34.120638146+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 31 11:51:34 dbinstance3 containerd: time="2020-07-31T11:51:34.464111797+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681/shim.sock" debug=false pid=23184

2、问题排查

1)根据Last_SQL_Error以及errorlog日志,我们可以看到主从复制中断是由于无法打开relay-log.000124,并且在后续准备通过relay-log.index重新初始化relay log时初始化失败。

2)根据以上报错,我们进入到数据库的datadir下,检查确认relay-log.index文件,发现该文件的relog编号开始从1开始重新计数,relay-log.000124丢失

3)查看系统日志我们发现一个比较可以的docker失败退出的信息(数据库是通过docker容器启动),基于此报错我们通过docker ps查看该数据库实例的容器启动时间,发现启动时间与故障发生时间刚好吻合。但是由于该docker并没有输出相关日志,所以不方便我们继续排查什么原因导致relay log丢失。

3、问题处理

对于整个relay log丢失,relay log初始化失败的情况,同样我们需要重新指定主从复制来进行恢复。若非GTID模式需要指定具体的binlog文件以及位点信息;若GTID模式指定MASTER_AUTO_POSITION=1即可。

mysql> reset slave;
mysql> CHANGE MASTER TO
  MASTER_HOST='172.26.44.1',
  MASTER_USER='rds_repl',
  MASTER_PASSWORD='xxx',
  MASTER_PORT=3044,
  MASTER_AUTO_POSITION=1;
mysql> start slave;
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值