mysql启用组提交变量_MySQL的COMMIT_ORDER模式下组提交分组实现与BUG案例源码剖析...

背景

自MySQL 5.7以来,组提交大面积应用,已经不断地得到优化。但网上有关组提交的实现机制,却还不够详细。故障多的时候,往往会发生一些模棱两可的揣测和猜疑。因此,笔者有了从自己的角度,去分析组提交实现的动机。

源码分析

以“last_committed”为入口,搜索MySQL 5.7.24源码,很快可以定位到关键类Transaction_dependency_tracker。梳理一下该类的虚函数实现以及调用位置,基本就理解了大部分实现。如下:

以此类为基础,再梳理调用过程,结合前面文章《MySQL的after_sync与after_commit性能之争源码剖析与情景测试》的分析,基本就可以梳理出具体的实现细节。基本实现

通过类Commit_order_trx_dependency_tracker的两个成员变量来基本实现:m_max_committed_transaction,保存已提交的最大事务号;m_transaction_counter,保存已prepare的最大事务号。获取last_committed与sequence_number值

如上图所示,实现该功能的是get_dependency函数,把last_committed与sequence_number的绝对值转换成相对值。而这个转换恰巧是在Binlog的flush阶段MYSQL_BIN_LOG::write_gtid函数中,生成GTID事件之前。更新m_max_committed_transaction值

update_max_committed函数的调用发生在commit队列处理函数MYSQL_BIN_LOG::process_commit_stage_queue中,遍历每个线程提交最开始的时候。线程获取m_max_committed_transaction值

对于非事务型语句(如DDL)来说,在binlog_prepare中获取last_committed绝对值。而对于事务型语句来说,则在提交的函数MYSQL_BIN_LOG::commit中,执行ordered_commit函数之前获取last_committed绝对值。sequence_number值累加

step()函数的调用也是在Binlog的flush阶段,binlog_cache_data::flush函数中,不过是在MYSQL_BIN_LOG::write_gtid函数之前,即把绝对值转换成相对值之前。具体实现为,sequence_number每次加1。

梳理出来的流程如下所示:

案例分析

初看时,想当然的会以为,两个DDL进入了同一组导致了死锁。然后就会着重分析为什么在主库被分配了同一组,而在从库却又冲突了。从上图的分析可以了解,获取last_committed值是在执行ordered_commit函数之前。这个时候,线程所有持有的MDL锁并未释放,所以两者存在冲突,根本就不可能同时进入提交。但仔细回头一看,就会发现不对,两个语句根本就不是DDL,而是DCL呀!为了了解DCL中关于MDL锁的释放位置,就在测试环境抓取分别抓取它们的debug日志来分析。flush privileges

...

// 第一部分3343 T@5: | | | | | >MDL_context::release_transactional_locks

3344 T@5: | | | | | | >MDL_context::release_locks_stored_before

3345 T@5: | | | | | | <:release_locks_stored_before>

3346 T@5: | | | | | | >MDL_context::release_locks_stored_before

3347 T@5: | | | | | | | info: found lock to release ticket=0x7f3c64012410

3348 T@5: | | | | | | | >MDL_context::release_lock

3349 T@5: | | | | | | | | enter: db=mysql name=proxies_priv

3350 T@5: | | | | | | | <:release_lock>

3351 T@5: | | | | | | | info: found lock to release ticket=0x7f3c64012320

3352 T@5: | | | | | | | >MDL_context::release_lock

3353 T@5: | | | | | | | | enter: db=mysql name=db

3354 T@5: | | | | | | | <:release_lock>

3355 T@5: | | | | | | | info: found lock to release ticket=0x7f3c6400e860

3356 T@5: | | | | | | | >MDL_context::release_lock

3357 T@5: | | | | | | | | enter: db=mysql name=user

3358 T@5: | | | | | | | <:release_lock>

3359 T@5: | | | | | | <:release_locks_stored_before>

3360 T@5: | | | | | <:release_transactional_locks>

3361 T@5: | | | |

...

// 第二部分5073 T@5: | | | | | >MDL_context::release_transactional_locks

5074 T@5: | | | | | | >MDL_context::release_locks_stored_before

5075 T@5: | | | | | | <:release_locks_stored_before>

5076 T@5: | | | | | | >MDL_context::release_locks_stored_before

5077 T@5: | | | | | | | info: found lock to release ticket=0x7f3c640308e0

5078 T@5: | | | | | | | >MDL_context::release_lock

5079 T@5: | | | | | | | | enter: db=mysql name=procs_priv

5080 T@5: | | | | | | | <:release_lock>

5081 T@5: | | | | | | | info: found lock to release ticket=0x7f3c640273e0

5082 T@5: | | | | | | | >MDL_context::release_lock

5083 T@5: | | | | | | | | enter: db=mysql name=columns_priv

5084 T@5: | | | | | | | <:release_lock>

5085 T@5: | | | | | | | info: found lock to release ticket=0x7f3c6400e860

5086 T@5: | | | | | | | >MDL_context::release_lock

5087 T@5: | | | | | | | | enter: db=mysql name=tables_priv

5088 T@5: | | | | | | | <:release_lock>

5089 T@5: | | | | | | <:release_locks_stored_before>

5090 T@5: | | | | | <:release_transactional_locks>

5091 T@5: | | | |

...

// 第三部分5641 T@5: | | | | | >restore_backup_open_tables_state

5642 T@5: | | | | | | >MDL_context::rollback_to_savepoint

5643 T@5: | | | | | | | >MDL_context::release_locks_stored_before

5644 T@5: | | | | | | | <:release_locks_stored_before>

5645 T@5: | | | | | | | >MDL_context::release_locks_stored_before

5646 T@5: | | | | | | | | info: found lock to release ticket=0x7f3c6400e860

5647 T@5: | | | | | | | | >MDL_context::release_lock

5648 T@5: | | | | | | | | | enter: db=mysql name=servers

5649 T@5: | | | | | | | | <:release_lock>

5650 T@5: | | | | | | | <:release_locks_stored_before>

5651 T@5: | | | | | | <:rollback_to_savepoint>

5652 T@5: | | | | |

5653 T@5: | | | | | >my_hash_free

5654 T@5: | | | | | | enter: hash: 0x7f3c64002eb0

5655 T@5: | | | | |

5656 T@5: | | | | | info: unlocking servers_cache

5657 T@5: | | | |

...

// 第四部分5764 T@5: | | | | >trans_commit_stmt

5765 T@5: | | | | | debug: add_unsafe_rollback_flags: 0

5766 T@5: | | | | | >MYSQL_BIN_LOG::commit

5767 T@5: | | | | | | info: query='flush privileges'

...

以上分别截取了三部分:第一部分,对mysql.proxies_priv、mysql.db以及mysql.user释放MDL锁;第二部分,对mysql.procs_priv、mysql.columns_priv以及mysql.tables_priv释放MDL锁;第三部分,对mysql.servers释放MDL锁;第四部分,提交。由此,可以发现,MDL锁在进入提交之前就已经释放了。为了进一步确认,于是对函数MDL_context::acquire_lock设置断点,调试结果如下:

因此,可以发现,获取的都是SR锁。grant privileges

...

// 第一部分2477 T@4: | | | | | >trans_commit_stmt

2478 T@4: | | | | | | debug: add_unsafe_rollback_flags: 0

2479 T@4: | | | | | | >MYSQL_BIN_LOG::commit

2480 T@4: | | | | | | | info: query='grant process on *.* to 'slave'@'192.168.10.%''

...

2724 T@4: | | | | | | | | | >THD::enter_cond

2725 T@4: | | | | | | | | | | THD::enter_stage: 'Waiting for semi-sync ACK from slave' /data/mysql-5.7.24/plugin/semisync/semisync_master.cc:735

2726 T@4: | | | | | | | | | | >PROFILING::status_change

2727 T@4: | | | | | | | | | | <:status_change>

2728 T@4: | | | | | | | | | <:enter_cond>

...

2816 T@4: | | | | |

2817 T@4: | | | | | >trans_commit_implicit

...

2843 T@4: | | | | |

// 第二部分2950 T@4: | | | | | >MDL_context::release_transactional_locks

2951 T@4: | | | | | | >MDL_context::release_locks_stored_before

2952 T@4: | | | | | | | info: found lock to release ticket=0x7f3c7000e9b0

2953 T@4: | | | | | | | >MDL_context::release_lock

2954 T@4: | | | | | | | | enter: db= name=

2955 T@4: | | | | | | | <:release_lock>

2956 T@4: | | | | | | <:release_locks_stored_before>

2957 T@4: | | | | | | >MDL_context::release_locks_stored_before

2958 T@4: | | | | | | | info: found lock to release ticket=0x7f3c7000e760

2959 T@4: | | | | | | | >MDL_context::release_lock

2960 T@4: | | | | | | | | enter: db=mysql name=db

2961 T@4: | | | | | | | <:release_lock>

2962 T@4: | | | | | | | info: found lock to release ticket=0x7f3c7000ea10

2963 T@4: | | | | | | | >MDL_context::release_lock

2964 T@4: | | | | | | | | enter: db=mysql name=user

2965 T@4: | | | | | | | <:release_lock>

2966 T@4: | | | | | | <:release_locks_stored_before>

2967 T@4: | | | | | <:release_transactional_locks>

...

以上分别截取了两部分:第一部分执行提交;第二部分,对mysql.db和mysql.user释放锁。依照上面的方式,同样进行调试,结果如下:

此时,可以发现,上述debug日志中的第一个锁为IX类型的GLOBAL锁,其它两个都是SW锁。死锁分析

从以上的分析,基本可以认定flush privileges和grant privileges的执行过程有很大的不同,如果SW锁和SR锁不兼容的话,基本可以认定,前者先执行,后者可以并行,反之则不行。为此,进行一下测试(让语句处理ACK等待状态),先flush privileges再grant privileges,

mysql> show processlist;

+-----+------+-----------+------+---------+------+--------------------------------------+------------------------------------------------+| Id | User | Host | db | Command | Time | State | Info |

+-----+------+-----------+------+---------+------+--------------------------------------+------------------------------------------------+| 196 | root | localhost | NULL | Query | 2449 | Waiting for semi-sync ACK from slave | flush privileges |

| 197 | root | localhost | NULL | Query | 2430 | checking permissions | grant process on *.* to 'slave'@'192.168.10.%' |

| 198 | root | localhost | NULL | Query | 0 | starting | show processlist |

+-----+------+-----------+------+---------+------+--------------------------------------+------------------------------------------------+3 rows in set (0.00 sec)

线程197的堆栈信息如下:

反之,先执行grant privileges再flush privileges,

mysql> show processlist;

+-----+-------+--------------------+------+------------------+------+---------------------------------------------------------------+------------------------------------------------+| Id | User | Host | db | Command | Time | State | Info |

+-----+-------+--------------------+------+------------------+------+---------------------------------------------------------------+------------------------------------------------+| 196 | root | localhost | NULL | Query | 12 | Waiting for table level lock | flush privileges |

| 197 | root | localhost | NULL | Query | 23 | Waiting for semi-sync ACK from slave | grant process on *.* to 'slave'@'192.168.10.%' |

| 198 | root | localhost | NULL | Query | 0 | starting | show processlist |

| 199 | slave | 192.168.10.4:39721 | NULL | Binlog Dump GTID | 133 | Master has sent all binlog to slave; waiting for more updates | NULL |

+-----+-------+--------------------+------+------------------+------+---------------------------------------------------------------+------------------------------------------------+4 rows in set (0.00 sec)

线程196的堆栈信息如下:

通过以上的堆栈信息,可以惊讶地发现,线程196已经进入了mysql_lock_tables函数,即已经拿到了MDL锁,足以说明SR锁和SW锁是兼容的。通过上面的函数调用,不难发现,它们正在等待的,是表锁【InnoDB实现的表锁,存储引擎层只定义了表锁实现接口】。通过对比debug日志,不难发现,该锁的释放时间与MDL锁是紧贴在MDL锁前面的,因而以上的结论是成立的,只是对象由MDL锁变成了InnoDB表锁。故障复现

主库设置:注意:binlog_group_commit_sync_delay最大值为100万微妙,即1s。

mysql> show variables like '%group_commit%';

+-----------------------------------------+---------+| Variable_name | Value |

+-----------------------------------------+---------+| binlog_group_commit_sync_delay | 1000000 |

| binlog_group_commit_sync_no_delay_count | 1000 |

+-----------------------------------------+---------+2 rows in set (0.01 sec)

执行脚本:在主库先执行flush privileges再执行grant privileges以保证进入同一组;然后线程1睡眠0.5s以保证flush privileges先获取sequence_number值,即在从库先提交;最后,在从库让先执行的线程后提交。

import threading

import time

import os

def thread_1():

os.system('date +%M:%S')

time.sleep(0.5)

os.system("/usr/local/mysql/bin/mysql -uroot -ps3cret -e\"grant process on *.* to 'slave'@'192.168.10.%';\"")

os.system('date +%M:%S')

def thread_2():

os.system('date +%M:%S')

os.system("/usr/local/mysql/bin/mysql -uroot -ps3cret -e 'flush privileges'")

os.system('date +%M:%S')

t1 = threading.Thread(target=thread_2)

t1.start()

t2 = threading.Thread(target=thread_1)

t2.start()

从库现象:结论

flush privileges设计的提交前就释放锁的机制导致了死锁,依次类推,针对mysql.user表的DML语句与flush privileges配合,也有可能产生死锁。

总结

组提交的实现依赖于两阶段提交,因而对于非事务性语句来说,锁的释放可能会出现在事务提交前,因而可能会出现本来冲突的两个语句获得同一个last_committed,即被分配到了同一组,进而导致了并行复制死锁。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值