MySQL5.7基于GTID的主从复制之sql线程报错error_code1062错误代码处理过程

背景

在巡检过程中发现prometheus当前存在一条mysql sql thread down告警,如下图所示。从告警名称看,当前mysql从库已经存在sql线程宕机问题,需要立刻处置。具体处置流程见下方所示
在这里插入图片描述

错误现象: io线程为Yes sql线程为No,同时从库会停止复制进程,不再进行主从同步,只有等到用户自行来处理完成后才会继续组从同步。

MySQL错误代码含义

一些error code代表的错误如下:
1007: 数据库已存在,创建数据库失败
1008: 数据库不存在,删除数据库失败
1050: 数据表已存在,创建数据表失败
1051: 数据表不存在,删除数据表失败
1054: 字段不存在,或程序文件跟数据库有冲突
1060: 字段重复,导致无法插入
1061: 重复键名
1068: 定义了多个主键
1094: 位置线程ID
1146: 数据表缺失,请恢复数据库
1053: 复制过程中主服务器宕机
1062: 主键冲突 Duplicate entry ‘%s’ for key %d

问题排查过程

排查过程:

		1、登录数据库,执行  show slave status\G; 命令

		2、执行后,相关信息如下所示
		mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 192.163.xx.xx
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.002037
          Read_Master_Log_Pos: 473170850
               Relay_Log_File: mysqld-relay-bin.005393
                Relay_Log_Pos: 345276015
        Relay_Master_Log_File: mysql-bin.001811
             Slave_IO_Running: YES
            Slave_SQL_Running: No #正常应该为yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 345275842
              Relay_Log_Space: 124217681926
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. #这个就是错误信息
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2233306
                  Master_UUID: 4edc6608-394c-11eb-a4d2-fa163efd1323
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 240904 07:27:25
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 4edc6608-394c-11eb-a4d2-fa163efd1323:27928-202465884
            Executed_Gtid_Set: 4edc6608-394c-11eb-a4d2-fa163efd1323:1-193750566,
5b42d742-394c-11eb-991e-fa163e6fbaa6:1-77
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

ERROR: 
No query specified
3、对该报错进行翻译并查询报错含义
		Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any
		整体意思如下: 
			协调器已停止,因为工作器中出现错误。
			最近的失败是:工作器1在主日志mysql-bin.001811,end_log_pos345391371中执行事务
			'4edc6608-394c-11eb-a4d2-fa163efd1323: 193750567'失败。
			有关此失败或其他失败的更多详细信息,请参阅错误日志和/或performance_schemareplication_applier_status_by_worker表
4、根据报错信息,我们查看performance_schemareplication_applier_status_by_worker表信息,执行  SELECT * FROM performance_schema.replication_applier_status_by_worker\G;返回如下所示
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193750567
    LAST_ERROR_NUMBER: 1062
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371; Could not execute Write_rows event on table yz_work.ucms_dict; Duplicate entry '135' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001811, end_log_pos 345391371
 LAST_ERROR_TIMESTAMP: 2024-09-04 07:27:25
*************************** 2. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193750565
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 3. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193750536
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193743275
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)

ERROR: 
No query specified
5、根据select查出来的结果,我们发现了详细的报错信息
		<Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' 
		at master log mysql-bin.001811, end_log_pos 345391371; Could not execute Write_rows event
		 		on table yz_work.ucms_dict; Duplicate entry '135' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001811, end_log_pos
		 		 345391371>
		翻译为中文如下所示: 
			Worker 1在主日志mysql-bin.001811,end_log_pos345391371执行事务
		'4edc6608-394c-11eb-a4d2-fa163efd1323: 193750567'失败;
		无法在表yz_work上执行Write_rows事件ucms_dict;密钥'PRIMARY'重复条目'135',
		Error_code:1062;处理程序错误HA_ERR_FOUND_DUPP_KEY;事件的主日志mysql-bin.001811,end_log_pos345391371

从这条报错信息,我们找到了问题根因,是因为插入重复主键导致从库不工作了,即主键冲突错误。

问题处理过程

第一种解决思路

直接跳过错误执行语句
1、MySQL 传统的主从复制方式使用 master_log_files 和 master_log_pos 两个参数来确定复制位点。
当出现复制错误时,可以设置跳过出错的事务来恢复同步,MySQL 提供了 sql_slave_skip_counter 参数来实现此功能。使用方法如下:
root@(none) >stop slave;
Query OK, 0 rows affected (0.00 sec)

root@(none) >SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N;    #跳过N个事务
Query OK, 0 rows affected (0.00 sec)

root@(none) >start slave;
Query OK, 0 rows affected, 1 warning (0.03 sec)
2、但是因为我们的数据库是基于GTID来做的主从复制,因此上述第一种方法不适用,执行会报错误,如下所示
mysql> stop slave;
Query OK, 0 rows affected (0.01 sec)

mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
ERROR 1858 (HY000): sql_slave_skip_counter can not be set when the server is running with @@GLOBAL.GTID_MODE = ON. Instead, for each transaction that you want to skip, generate an empty transaction with the same GTID as the transaction
3、基于GTID主从复制1062错误的真正的解决方法如下所示
解决方法: 在发生错误的 GTID 事务号上插入一个空的事务,来实现跳过此复制错误的效果。
解决过程:
	mysql> stop slave;
	mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                 ...
               Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

	mysql> SET GTID_NEXT='4edc6608-394c-11eb-a4d2-fa163efd1323:193750567'; #此处的gtid取自于上述Last_Error中的报错信息
	Query OK, 0 rows affected (0.01 sec)

	mysql> begin; #开启事务
	Query OK, 0 rows affected (0.00 sec)

	mysql> commit; #提交空事务
	Query OK, 0 rows affected (0.01 sec)

	mysql> SET GTID_NEXT="AUTOMATIC";  #重新设置 GTID_NEXT 为自动获取
	Query OK, 0 rows affected (0.00 sec)

	mysql> start slave; #开启事务
	Query OK, 0 rows affected (0.01 sec)

第二种解决思路

找到错误执行语句,修复从库数据
解决思路:
	根据show slave status 中Last_Error中提示的master log和end_log_pos的位置查找该库上缺失的数据

解决方法

0、根据报错信息,我们查看performance_schemareplication_applier_status_by_worker表信息,执行  
SELECT * FROM performance_schema.replication_applier_status_by_worker\G; 
它会返回详细的错误信息,包含库名、表名等信息

1、在主库中查看表结构,确定主键,并确认是否存在对应信息
mysql >
	......
  PRIMARY KEY (`id`) #主键信息
) ENGINE=InnoDB AUTO_INCREMENT=136 DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='字典数据表' |

mysql> select * from ucms_dict where id=135;
+-----+--------------+---------------+--------------+------------+-----------+-----------+------------+---------------------+------------+---------------------+-------------+--------+
| id  | dict_name    | dict_code     | dict_enrty   | dict_value | dict_sort | is_enable | created_by | created_time        | updated_by | updated_time        | delete_flag | remark |
+-----+--------------+---------------+--------------+------------+-----------+-----------+------------+---------------------+------------+---------------------+-------------+--------+
| 135 | 问题性质     | problemNature | 企业动态     |          6 |         6 |         0 |            | 2024-07-16 18:49:11 |            | 2024-07-16 18:49:11 |           0 | NULL   |
+-----+--------------+---------------+--------------+------------+-----------+-----------+------------+---------------------+------------+---------------------+-------------+--------+
1 row in set (0.00 sec)


2、在从库的binlog日志中解析last error报错的master log mysql-bin.001811, end_log_pos
[root@db-2 binlog]# mysqlbinlog --no-defaults -v --base64-output=decode-rows  --stop-position=345391371(取自于show slave status) /export/servers/data/my3306/binlog/mysql-bin.001811(取自于show slave status) | tail -500
# at 345375314
#240716  8:00:02 server id 2233306  end_log_pos 345383128 CRC32 0x225674ce 	Delete_rows: table id 2984
# at 345383128
#240716  8:00:02 server id 2233306  end_log_pos 345391282 CRC32 0x3746f8aa 	Delete_rows: table id 2984
# at 345391282
#240716  8:00:02 server id 2233306  end_log_pos 345399443 CRC32 0x7cb6e901 	Delete_rows: table id 2984
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

3、确定从库二进制日志中记录的插入数据和主库、从库里的数据是不是相同,如果相同可以删除从库的数据来解决,否则可以考虑跳过这个事务来解决,即使用第一种思路解决

4、执行删除操作
mysql> stop slave;
mysql> delete from jaat.test where id=135; #此处的id取决于SELECT * FROM performance_schema.replication_applier_status_by_worker\G中的Duplicate entry '135'
mysql> start slave;
mysql> show slave status\G;

修复结果演示

执行完修复操作后,再次查看从库状态
mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.161.3.223
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.002037
          Read_Master_Log_Pos: 495785235
               Relay_Log_File: mysqld-relay-bin.005396
                Relay_Log_Pos: 224060624
        Relay_Master_Log_File: mysql-bin.001812
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes #正常
  • 7
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值