现网一个产品报警主从延时,到备机上检查,果然延时(Seconds_Behind_Master)越来越严重,但是Relay_Log_pos值也是慢慢变大的,说明备机在跟主机数据,但跟得比较慢,特别是有System Lock。分析一下。
一、检查
mysql> show processlist;
+-------+-------------+--------------------+------+-------------+--------+-----------------------------------------------------------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+-------------+--------------------+------+-------------+--------+-----------------------------------------------------------------------------+------------------+
| 1 | system user | | NULL | Connect | 171318 | Waiting for master to send event | NULL |
| 2 | system user | | NULL | Connect | 312206 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL |
| 3 | system user | | NULL | Connect | 306 | Waiting for an event from Coordinator | NULL |
| 4 | system user | | NULL | Connect | 10587 | System Lock | NULL |
| 5 | system user | | NULL | Connect | 126 | Waiting for an event from Coordinator | NULL |
| 6 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 13 | repl | 192.168.17.59:61674 | NULL | Binlog Dump | 171281 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL |
| 74870 | root | localhost | NULL | Query | 0 | init | show processlist |
+-------+-------------+--------------------+------+-------------+--------+-----------------------------------------------------------------------------+------------------+
9 rows in set (0.00 sec)
mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin.000057
Read_Master_Log_Pos: 792576413
Relay_Log_File: relay-bin.000170
Relay_Log_Pos: 381505193
Relay_Master_Log_File: mysql-bin.000057
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 792576413
Relay_Log_Space: 561858267
Until_Condition: None
Seconds_Behind_Master: 10627
Master_UUID: 41690a50-296b-11e6-a0f1-525400fd6bba
Master_Info_File: /data/mysql/datanode1/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: System Lock
Master_Retry_Count: 86400
1 row in set (0.00 sec)
二、定位
[root@db02 datanode1]# mysqlbinlog --no-defaults -v relay-bin.000170 --start-position=381505193 --base64-output=decode-rows |more
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 381505193
#161124 14:29:29 server id 53 end_log_pos 796951500 CRC32 0x490aa291 Querythread_id=96281495exec_time=0error_code=0
SET TIMESTAMP=1479968969/*!*/;
SET @@session.pseudo_thread_id=96281495/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
# at 381507211
#161124 14:29:29 server id 53 end_log_pos 796953659 CRC32 0x9c5e6005 Update_rows: table id 11108 flags: STMT_END_F
### UPDATE `db_test`.`tbl_user_log`
### WHERE
### @1=228856
### @2='test_mobi'
### @3='2346150430'
### @4='2016-06-02 14:34:25'
### @5='2016-11-24 14:23:34'
### @6=NULL
### @7=0
### @8=0
### @9=2709
### @10='1.1.1'
### @11='client'
### @12=NULL
### @13='2016-06-24 21:45:22'
### @14=282
### @15='userinfo_sys'
### SET
### @1=228856
### @2='test_mobi'
### @3='2346150430'
### @4='2016-06-02 14:34:25'
### @5='2016-11-24 14:29:29'
### @6=NULL
### @7=0
### @8=0
### @9=2710
### @10='1.1.1'
### @11='client'
### @12=NULL
### @13='2016-06-24 21:45:22'
### @14=282
### @15='userinfo_sys'
# at 381507429
#161124 14:29:29 server id 53 end_log_pos 796953690 CRC32 0x7eec6df7 Xid = 2585781884
COMMIT/*!*/;
三、查看打开的表
mysql> show open tables;
Database Table In_use Name_locked
| db_test | tbl_1 | 0 | 0 |
| db_test | tbl_user_log | 1 | 0 |
| db_test | tbl_2 | 0 | 0 |
3 rows in set (0.00 sec)
四、分析排查
基本可以定位是这个大批量的update导致的备机延时。这个tbl_user_log表是建了Btree索引的,索引的数据也是比较均匀分散的,但是上面没有主键或者唯一索引。该表有300多万数据,通过观察,slave上的update在relay回放时是很慢的,索引并没起作用,类似全表扫描。遂在该表上建立主键,如果备机延时比较久了,可以通过忽略记录binglog的方式在主备上加个主键操作。
主机上:
set session sql_log_bin=0;
alter table tbl_user_log add primary key(id);
set session sql_log_bin=1;
备机上:
set session sql_log_bin=0;
alter table tbl_user_log add primary key(id);
set session sql_log_bin=1;
五、注意总结
通过对该表添加主键的方式后,备机过了几分钟就把落后几个小时的数据给追回来了。后续对上线的表要做好审计,mysql的innodb表一定得要有主键,设置sql_log_bin时不要加global参数。