2、重新复现、详细分析与验证
为了方便问题排查,我们需要按照复现过程重新来一次(再把master_auto_position从1切换为0),这一次,我们每一步昨晚都先检查一些信息,以便找出问题的起源以及找到从根本上杜绝这个问题发生的方法!!
需要检查的信息有:
stop slave;之前先show slave status查看一下复制的大概位置信息
查看并收集stop slave;时、change master to master_auto_position=0时、 start slave;时三个步骤的如下信息:
* 从库查看复制信息:show slave status;
* 从库查看slave_master_info表:select * from mysql.slave_master_info;
* 从库查看slave_relay_log_info表:select * from mysql.slave_relay_log_info;
* 从库查看slave_worker_info表:select * from mysql.slave_worker_info;
* 从库查看replication_applier_status_by_worker表:select * from performance_schema.replication_applier_status_by_worker;
* 从库查看replication_connection_status表:select * from performance_schema.replication_connection_status;
* 从库查看relay log是否被清理,解析relay log,查看所有relay log中当前的起始GTID
* 根据show slave
status输出信息中的Master_Log_File、Read_Master_Log_Pos、Relay_Master_Log_File、Exec_Master_Log_Pos、Retrieved_Gtid_Set、Executed_Gtid_Set值解析主库的binlog,找到对应这些值的事务相关event信息
* 从库解析binlog file,查看show slave status输出中Executed_Gtid_Set值、Executed_Gtid_Set值相比Retrieved_Gtid_Set值缺失的GTID在从库的binlog中是否能找到
为了方便大家理解下文中将要提到的几个步骤中的冗长信息,这里将这些信息整理为三个脑图(这里紧列出关键信息,为了简洁,GTID去掉了UUID部分,只留下事务ID),有兴趣的可以跟着下文向下看详细信息,兴趣不大的可以从这里直接跳转到"解决方法验证"小节
执行stop slave语句时主从的相关详细信息
执行change master to master_auto_position=0语句时主从的相关详细信息
执行start slave语句时主从的相关详细信息
2.1. 重新复现与详细分析
sysbench压力根据步骤需要,可能暂停或者重新执行
2.1.1. stop slave
sysbench执行对主库加压
从库上执行show slave status\G语句查看复制状态
admin@localhost : (none) 12:19:20> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
......
Master_Log_File: mysql-bin.000038
Read_Master_Log_Pos: 19058431
Relay_Log_File: mysql-relay-bin.000006
Relay_Log_Pos: 11609958
Relay_Master_Log_File: mysql-bin.000038
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
......
Exec_Master_Log_Pos: 11609745
......
Seconds_Behind_Master: 61
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
......
Slave_SQL_Running_State: Waiting for slave workers to process their queues
......
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
......
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-764951
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756013
Auto_Position: 1
......
1 row in set (0.00 sec)
待到从库复制有一定延迟时,执行stop slave语句
admin@localhost : (none) 12:21:36> stop slave;
Query OK, 0 rows affected (0.11 sec)
从库执行show slave status语句查看复制状态
admin@localhost : (none) 12:21:43> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State:
......
Master_Log_File: mysql-bin.000038
Read_Master_Log_Pos: 19989790
Relay_Log_File: mysql-relay-bin.000006
Relay_Log_Pos: 12148076
Relay_Master_Log_File: mysql-bin.000038
Slave_IO_Running: No
Slave_SQL_Running: No
......
Exec_Master_Log_Pos: 12147863
......
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
......
Slave_SQL_Running_State:
......
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
......
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
Auto_Position: 1
......
1 row in set (0.00 sec)
现在,把对主库的sysbench进程终止
从库查询slave_master_info表
admin@localhost : (none) 12:21:46> select * from mysql.slave_master_info\G;
*************************** 1. row ***************************
Number_of_lines: 25
Master_log_name: mysql-bin.000038
Master_log_pos: 19989790
Host: 10.10.20.14
User_name: qfsys
User_password: letsg0
Port: 3306
Connect_retry: 60
Enabled_ssl: 0
Ssl_ca:
Ssl_capath:
Ssl_cert:
Ssl_cipher:
Ssl_key:
Ssl_verify_server_cert: 0
Heartbeat: 5
Bind:
Ignored_server_ids: 0
Uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
Retry_count: 86400
Ssl_crl:
Ssl_crlpath:
Enabled_auto_position: 1
Channel_name:
Tls_version:
1 row in set (0.00 sec)
从库查询slave_relay_log_info表
admin@localhost : (none) 12:25:57> select * from mysql.slave_relay_log_info\G;
*************************** 1. row ***************************
Number_of_lines: 7
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
'# 与show slave status输出中的Relay_log_pos值相等'
Relay_log_pos: 12148076
Master_log_name: mysql-bin.000038
Master_log_pos: 12147863
Sql_delay: 0
Number_of_workers: 4
Id: 1
Channel_name:
1 row in set (0.00 sec)
从库查询slave_worker_info表
admin@localhost : (none) 12:26:23> select * from mysql.slave_worker_info\G;
*************************** 1. row ***************************
Id: 1
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
Relay_log_pos: 12146410
Master_log_name: mysql-bin.000038
Master_log_pos: 12146197
Checkpoint_relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
Checkpoint_relay_log_pos: 12130583
Checkpoint_master_log_name: mysql-bin.000038
Checkpoint_master_log_pos: 12130370
Checkpoint_seqno: 18
Checkpoint_group_size: 64
Checkpoint_group_bitmap: HD
Channel_name:
*************************** 2. row ***************************
Id: 2
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
Relay_log_pos: 12144744
Master_log_name: mysql-bin.000038
Master_log_pos: 12144531
Checkpoint_relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
Checkpoint_relay_log_pos: 12130583
Checkpoint_master_log_name: mysql-bin.000038
Checkpoint_master_log_pos: 12130370
Checkpoint_seqno: 16
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
从库查询replication_applier_status_by_worker表
admin@localhost : (none) 12:27:02> select * from performance_schema.replication_applier_status_by_worker\G;
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756653
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
CHANNEL_NAME:
WORKER_ID: 2
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756651
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 3. row ***************************
CHANNEL_NAME:
WORKER_ID: 3
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756652
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
CHANNEL_NAME:
WORKER_ID: 4
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)
从库查询replication_connection_status表
admin@localhost : (none) 12:28:29> select * from performance_schema.replication_connection_status\G;
*************************** 1. row ***************************
CHANNEL_NAME:
GROUP_NAME:
SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
THREAD_ID: NULL
SERVICE_STATE: OFF
COUNT_RECEIVED_HEARTBEATS: 10030
LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 12:19:36
RECEIVED_TRANSACTION_SET: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
1 row in set (0.00 sec)
查看gtid_executed表与show master
status信息,从下面的结果中可以看到,gtid_executed表中记录的GTID事务号是741923,远小于show master
status的最大GTID事务号756655,说明这个表中记录GTID信息并不是实时的
admin@localhost : (none) 02:01:49> select * from mysql.gtid_executed\G;
*************************** 1. row ***************************
source_uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
interval_start: 1
interval_end: 741923
1 row in set (0.01 sec)
admin@localhost : (none) 02:06:54> show master status;
+------------------+----------+--------------+------------------+-----------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_I