多线程回放+flush tables with read lock 死锁

多线程回放+flush tables with read lock 死锁

一、场景描述

MySQL-5.7.18 slave实例上夜间进行备份操作时,

processlist 结果,只列出关键部分

mysql> show processlist;l
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
| Id | User        | Host      | db                 | Command | Time | State                                                  | Info                                       |
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
|  3 | root        | localhost | test_dead_lock     | Query   | 2264 | Waiting for commit lock                                | flush tables with read lock                |
|  4 | system user |           | NULL               | Connect | 2708 | Waiting for master to send event                       | NULL                                       |
|  5 | system user |           | NULL               | Connect | 2289 | Slave has read all relay log; waiting for more updates | NULL                                       |
|  6 | system user |           | NULL               | Connect | 2319 | Waiting for global read lock                           | insert into ashe(id,name) values(40,'aaa') |
|  7 | system user |           | NULL               | Connect | 2308 | Waiting for preceding transaction to commit            | NULL                                       |
|  8 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
|  9 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 10 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 11 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 12 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 13 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 14 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 15 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 16 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |

slave相关参数设置

mysql> show global variables like '%slave%'
    -> ;
+-------------------------------------------+-----------------------+
| Variable_name                             | Value                 |
+-------------------------------------------+-----------------------+
| init_slave                                |                       |
| log_slave_updates                         | ON                    |
| log_slow_slave_statements                 | ON                    |
| rpl_semi_sync_master_wait_for_slave_count | 1                     |
| rpl_semi_sync_master_wait_no_slave        | ON                    |
| rpl_semi_sync_slave_enabled               | ON                    |
| rpl_semi_sync_slave_trace_level           | 32                    |
| rpl_stop_slave_timeout                    | 31536000              |
| slave_allow_batching                      | OFF                   |
| slave_checkpoint_group                    | 512                   |
| slave_checkpoint_period                   | 300                   |
| slave_compressed_protocol                 | OFF                   |
| slave_exec_mode                           | STRICT                |
| slave_load_tmpdir                         | /data/mysql/mysql_tmp |
| slave_max_allowed_packet                  | 1073741824            |
| slave_net_timeout                         | 30                    |
| slave_parallel_type                       | LOGICAL_CLOCK         |
| slave_parallel_workers                    | 16                    |
| slave_pending_jobs_size_max               | 16777216              |
| slave_preserve_commit_order               | ON                    |
| slave_rows_search_algorithms              | INDEX_SCAN,HASH_SCAN  |
| slave_skip_errors                         | OFF                   |
| slave_sql_verify_checksum                 | ON                    |
| slave_transaction_retries                 | 128                   |
| slave_type_conversions                    |                       |
| sql_slave_skip_counter                    | 0                     |
+-------------------------------------------+-----------------------+

此slave实例没有为业务提供查询服务,凌晨会有xtrabackup备份操作。

二、死锁排查

从processlist中可以看到如下信息

  1. 线程3是xtrabackup 下发的flush tables with read lock 被其他线程阻塞,一般出现这个问题,证明是有慢语句存在。
  2. 线程6是sql回放线程,被线程3阻塞
  3. 线程7的状态为 Waiting for preceding transaction to commit

排查思路

  1. Waiting for preceding transaction to commit的意思
    这条信息对应如下的PSI_stage_info
PSI_stage_info stage_worker_waiting_for_its_turn_to_commit= { 0, "Waiting for preceding transaction to commit", 0};

查看合适进入stage_worker_waiting_for_its_turn_to_commit的状态,是在Commit_order_manager::wait_for_its_turn中

bool Commit_order_manager::wait_for_its_turn(Slave_worker *worker,
                                                  bool all)
{
  DBUG_ENTER("Commit_order_manager::wait_for_its_turn");

  /*
    When prior transaction fail, current trx should stop and wait for signal
    to rollback itself
  */
  if ((all || ending_single_stmt_trans(worker->info_thd, all) || m_rollback_trx) &&
      m_workers[worker->id].status == OCS_WAIT)
  {
    PSI_stage_info old_stage;
    mysql_cond_t *cond= &m_workers[worker->id].cond;
    THD *thd= worker->info_thd;

    DBUG_PRINT("info", ("Worker %lu is waiting for commit signal", worker->id));

    mysql_mutex_lock(&m_mutex);
    thd->ENTER_COND(cond, &m_mutex,
                    &stage_worker_waiting_for_its_turn_to_commit,
                    &old_stage);

其实Commit_order_manager这个类是为了解决slave并行回放时,slave binlog乱序的问题。就是slave的binlog提交顺序严格按照主库上的提交顺序来,那么如果两个事务可以并行回放,并且编号更大的事务执行的比更小的事务要快,则此事务进入order commit阶段时必须等待事务编号较小的事务。

  1. 根据如上所述可以分析出必然有事务未进入到提交阶段,可以查看其他的sql回放线程
    如下,但是发现此事务已经被flush tables阻塞了,
|  6 | system user |           | NULL               | Connect | 2319 | Waiting for global read lock                           | insert into ashe(id,name) values(40,'aaa') |

其实出问题的时候,可以通过查询全局的gtid_owned来验证猜测,如下:

 gtid_owned                       | cdfe45e6-c227-11e8-abf5-001c42bf9720:19#6:20#7

可以看到线程6当前执行的事务的gtid的sid为19,线程7的为20,线程7确实是在等待线程6.

3.查看线程6的状态

 Waiting for global read lock

这证明是在sql执行时,打开表被阻塞。

4.基本上可以定位到问题了,

  • flush tables with read lock 被线程7阻塞

  • 线程7要等待线程6,保证顺序提交

  • 线程6等待 flush tables with read lock。

三、解决办法

猜想一下,能用什么办法解决呢?

  • 假设一:stop slave
  • 假设二:kill flush tables
  • 假设三:kill sql回放线程

四、如何复现的?

需要对多线程复制,事务提交等有比较清晰的认识才可以。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值