背景
维护MySQL不可避免要经常使用stop slave语句,然而总有那么些意外,导致了hang,死的莫名奇妙。情况好点,也就等个好几十秒,情况差点直接就死锁了。对于金融级MySQL来说,几十秒的业务受损时间也是相当严重的。刚好,这几天又发生了一起hang,在同事的一次凌晨切换演习中,执行stop slave阻塞了60~80s,恰巧同事也在阻塞恢复前30s左右保留了现场的pstack。于是,又开启了心血来潮的探索。
堆栈分析
- debug日志分析
首先对测试环境中stop slave语句的debug日志进行采集分析,大致内容如下所示:
1223 T@16: | | | | | >stop_slave(THD)
1224 T@16: | | | | | | >Multisource_info::get_num_instances
1225 T@16: | | | | | | <Multisource_info::get_num_instances 254
1226 T@16: | | | | | | >stop_slave(THD, Master_info, bool, bool
...
1234 T@16: | | | | | | | THD::enter_stage: 'Killing slave' /data/mysql-5.7.24/sql/rpl_slave.cc:10200
...
1241 T@16: | | | | | | | >terminate_slave_threads
1242 T@16: | | | | | | | | info: Terminating SQL thread
1243 T@16: | | | | | | | | >terminate_slave_thread
...
1251 T@16: | | | | | | | | info: Flushing relay-log info file.
1252 T@16: | | | | | | | | THD::enter_stage: 'Flushing relay-log info file.' /data/mysql-5.7.24/sql/rpl_slave.cc:1685
...
1276 T@16: | | | | | | | | | | | | | | >open_and_process_table
1277 T@16: | | | | | | | | | | | | | | | tcache: opening table: 'mysql'.'slave_relay_log_info' item: 0x7f1e183cbb70
...
1721 T@16: | | | | | | | | <Relay_log_info::flush_info 2533
1722 T@16: | | | | | | | | info: Terminating IO thread
...
1734 T@16: | | | | | | | | THD::enter_stage: 'Flushing relay log and master info repository.' /data/mysql-5.7.24/sql/rpl_slave.cc:1726
...
1761 T@16: | | | | | | | | | | | | | | >open_and_process_table
1762 T@16: | | | | | | | | | | | | | | | tcache: opening table: 'mysql'.'slave_master_info' item: 0x7f1e183cbb60
...
2772 T@16: | | | | | | | <terminate_slave_threads 1749
2773 T@16: | | | | | | | >unlock_slave_threads
2774 T@16: | | | | | | | <unlock_slave_threads 341
2775 T@16: | | | | | | | info: 0x60bfee0.unlock()
2776 T@16: | | | | | | <stop_slave(THD, Master_info, bool, bool 10272
总结:
- 进入Killing slave状态
- 进入Flushing relay-log info file状态,更新mysql.slave_relay_log_info表,关闭SQL线程
- 进入Flushing relay log and master info repository状态,更新mysql.slave_master_info表,关闭IO线程
- 堆栈分析
执行 stop slave的用户线程
# python pstack_analytic.py -i termin -f stop_slave_hang_pstack.log
Thread: 28
#0 0x00007f96ff3d5cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000f0eff3 in native_cond_timedwait (abstime=0x7f8cab9fc3f0, mutex=0x93baa528, cond=0x93baa628) at /export/home/pb2/build/sb_0-33648028-1555164244.06/mysql-5.7.26/include/thr_cond.h:129
#2 my_cond_timedwait (abstime=0x7f8cab9fc3f0, mp=0x93baa528, cond=0x93baa628) at /export/home/pb2/build/sb_0-33648028-1555164244.06/mysql-5.7.26/include/thr_cond.h:182
#3 inline_mysql_cond_timedw