MySQL无cpu卡死故障深入分析

版本:

percona MySQL 5.6.45 版本。

现象

Master库无监控数据,无法建立连接,mysqld进程不占cpu。外部手段无法获得得到更多信息,且这个实例数次出现该情况,通过kill掉重启恢复,这次出现后将业务流量切走保留了现场,该环境可以自由操作,运维同学希望能定位到根本原因,以后尽量避免此类问题。

初步分析

简单测试了下机器IO,观察了cpu、内存等信息,看了下数据文件,binlog文件,除了binlog文件偶有大文件外,其它正常。在实例机器本地登陆也会卡住,显然存在死锁,而且是连接建立就会用到的锁被其它线程占用没释放,因此无法登陆,该死锁非innodb的死锁,往往程序逻辑预期外的死锁,一般是代码逻辑bug导致。

深入解剖

当外部手段无法分析数据库状态时,打印数据库进程堆栈,下载对应版本MySQL源码结合分析每个线程上下文,可以作为最后的手段。找到其中关键线程:Thread 59,Thread 70,Thread 71,Thread 74,Thread 76,具体堆栈信息内容如下:

Thread 59 (Thread 0x7f53445a6700 (LWP 446644)):
#0  0x00007f5c3369b4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5c3369dcdc in _L_cond_lock_847 () from /lib64/libpthread.so.0
#2  0x00007f5c3369db71 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#3  0x00007f5c33698de4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4  0x00000000008ca3c0 in inline_mysql_cond_timedwait (src_file=0xd3a698 "/root/percona-server-5.6.45-86.1/sql/binlog.cc", src_line=6551, abstime=0x7f53445a4390, mutex=0x13f5288 <mysql_bin_log+8>, that=0x13f57c8 <mysql_bin_log+1352>) at /root/percona-server-5.6.45-86.1/include/mysql/psi/mysql_thread.h:1199
#5  MYSQL_BIN_LOG::wait_for_update_bin_log (this=0x13f5280 <mysql_bin_log>, thd=thd@entry=0x438b410, timeout=timeout@entry=0x7f53445a4390) at /root/percona-server-5.6.45-86.1/sql/binlog.cc:6551
#6  0x00000000008dcd93 in mysql_binlog_send (thd=thd@entry=0x438b410, log_ident=0x7f52a4000bd0 "mysql-bin.000289", pos=15336, slave_gtid_executed=slave_gtid_executed@entry=0x0, flags=flags@entry=2) at /root/percona-server-5.6.45-86.1/sql/rpl_master.cc:1748
#7  0x00000000008dd7c7 in com_binlog_dump (thd=thd@entry=0x438b410, packet=packet@entry=0x438f1d1 "", packet_length=packet_length@entry=26) at /root/percona-server-5.6.45-86.1/sql/rpl_master.cc:765
#8  0x0000000000710907 in dispatch_command (command=COM_BINLOG_DUMP, thd=0x438b410, packet=0x438f1d1 "", packet_length=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1739
#9  0x0000000000712b41 in do_command (thd=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1093
#10 0x00000000006d764d in do_handle_one_connection (thd_arg=thd_arg@entry=0x438b410) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1588
#11 0x00000000006d7699 in handle_one_connection (arg=arg@entry=0x438b410) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1492
#12 0x000000000095c9c3 in pfs_spawn_thread (arg=0x4393600) at /root/percona-server-5.6.45-86.1/storage/perfschema/pfs.cc:1861
#13 0x00007f5c33694dd5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f5c3235a02d in clone () from /lib64/libc.so.6

Thread 70 (Thread 0x7f534487c700 (LWP 446632)):
#0  0x00007f5c3369b4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5c33696dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f5c33696c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000006ce4d8 in inline_mysql_mutex_lock (src_file=0xcb0c40 "/root/percona-server-5.6.45-86.1/sql/sql_class.cc", src_line=2031, that=0x13f5288 <mysql_bin_log+8>) at /root/percona-server-5.6.45-86.1/include/mysql/psi/mysql_thread.h:690
#4  THD::awake (this=this@entry=0x4317f20, state_to_set=state_to_set@entry=THD::KILL_QUERY) at /root/percona-server-5.6.45-86.1/sql/sql_class.cc:2031
#5  0x00000000008dd4c3 in kill_zombie_dump_threads (thd=thd@entry=0x436a2b0) at /root/percona-server-5.6.45-86.1/sql/rpl_master.cc:2228
#6  0x00000000008dd786 in com_binlog_dump (thd=thd@entry=0x436a2b0, packet=packet@entry=0x4313241 "\350;", packet_length=packet_length@entry=26) at /root/percona-server-5.6.45-86.1/sql/rpl_master.cc:761
#7  0x0000000000710907 in dispatch_command (command=COM_BINLOG_DUMP, thd=0x436a2b0, packet=0x4313241 "\350;", packet_length=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1739
#8  0x0000000000712b41 in do_command (thd=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1093
#9  0x00000000006d764d in do_handle_one_connection (thd_arg=thd_arg@entry=0x4330830) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1588
#10 0x00000000006d7699 in handle_one_connection (arg=arg@entry=0x4330830) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1492
#11 0x000000000095c9c3 in pfs_spawn_thread (arg=0x430aa20) at /root/percona-server-5.6.45-86.1/storage/perfschema/pfs.cc:1861
#12 0x00007f5c33694dd5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5c3235a02d in clone () from /lib64/libc.so.6

Thread 71 (Thread 0x7f53448be700 (LWP 446631)):
#0  0x00007f5c3369b4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5c33696e01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f5c33696da2 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000744b22 in inline_mysql_mutex_lock (src_file=0xcbd9d0 "/root/percona-server-5.6.45-86.1/sql/sql_show.cc", src_line=3117, that=<optimized out>) at /root/percona-server-5.6.45-86.1/include/mysql/psi/mysql_thread.h:690
#4  calc_sum_of_all_status (to=to@entry=0x7f53448bbdc0) at /root/percona-server-5.6.45-86.1/sql/sql_show.cc:3117
#5  0x0000000000745eba in fill_status (thd=thd@entry=0x4329070, tables=tables@entry=0x7f52f0005130, cond=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_show.cc:7394
#6  0x000000000074c5da in do_fill_table (join_table=0x7f52f00067b8, table_list=0x7f52f0005130, thd=0x4329070) at /root/percona-server-5.6.45-86.1/sql/sql_show.cc:8065
#7  get_schema_tables_result (join=join@entry=0x7f52f0005a90, executed_place=executed_place@entry=PROCESSED_BY_JOIN_EXEC) at /root/percona-server-5.6.45-86.1/sql/sql_show.cc:8166
#8  0x000000000072d0c5 in JOIN::prepare_result (this=this@entry=0x7f52f0005a90, columns_list=columns_list@entry=0x7f53448bc590) at /root/percona-server-5.6.45-86.1/sql/sql_select.cc:823
#9  0x00000000006e3851 in JOIN::exec (this=0x7f52f0005a90) at /root/percona-server-5.6.45-86.1/sql/sql_executor.cc:116
#10 0x000000000072d6d5 in mysql_execute_select (free_join=true, select_lex=0x432b928, thd=0x4329070) at /root/percona-server-5.6.45-86.1/sql/sql_select.cc:1101
#11 mysql_select (thd=thd@entry=0x4329070, tables=0x7f52f0005130, wild_num=0, fields=..., conds=0x0, order=order@entry=0x432baf0, group=group@entry=0x432ba28, having=0x0, select_options=2684619520, result=result@entry=0x7f52f0005a68, unit=unit@entry=0x432b2e0, select_lex=select_lex@entry=0x432b928) at /root/percona-server-5.6.45-86.1/sql/sql_select.cc:1222
#12 0x000000000072dfe5 in handle_select (thd=thd@entry=0x4329070, result=result@entry=0x7f52f0005a68, setup_tables_done_option=setup_tables_done_option@entry=0) at /root/percona-server-5.6.45-86.1/sql/sql_select.cc:110
#13 0x000000000057a1ab in execute_sqlcom_select (thd=thd@entry=0x4329070, all_tables=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:5780
#14 0x000000000070b53e in mysql_execute_command (thd=thd@entry=0x4329070) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:3058
#15 0x000000000070f568 in mysql_parse (thd=thd@entry=0x4329070, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7f53448bd650, update_userstat=update_userstat@entry=false) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:7062
#16 0x0000000000710b74 in dispatch_command (command=<optimized out>, thd=0x4329070, packet=0x432c821 "show global status like 'Uptime'", packet_length=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1524
#17 0x0000000000712b41 in do_command (thd=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1093
#18 0x00000000006d764d in do_handle_one_connection (thd_arg=thd_arg@entry=0x4329070) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1588
#19 0x00000000006d7699 in handle_one_connection (arg=arg@entry=0x4329070) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1492
#20 0x000000000095c9c3 in pfs_spawn_thread (arg=0x4328d70) at /root/percona-server-5.6.45-86.1/storage/perfschema/pfs.cc:1861
#21 0x00007f5c33694dd5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f5c3235a02d in clone () from /lib64/libc.so.6

Thread 74 (Thread 0x7f5344984700 (LWP 446628)):
#0  0x00007f5c3369b4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5c33696e01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f5c33696da2 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000008cd3ca in inline_mysql_mutex_lock (src_file=0xd3a698 "/root/percona-server-5.6.45-86.1/sql/binlog.cc", src_line=8038, that=<optimized out>) at /root/percona-server-5.6.45-86.1/include/mysql/psi/mysql_thread.h:690
#4  MYSQL_BIN_LOG::publish_coordinates_for_global_status (this=0x13f5280 <mysql_bin_log>) at /root/percona-server-5.6.45-86.1/sql/binlog.cc:8038
#5  0x00000000008d253c in MYSQL_BIN_LOG::ordered_commit (this=this@entry=0x13f5280 <mysql_bin_log>, thd=thd@entry=0x42fe510, all=all@entry=false, skip_commit=skip_commit@entry=false) at /root/percona-server-5.6.45-86.1/sql/binlog.cc:7777
#6  0x00000000008d2df4 in MYSQL_BIN_LOG::commit (this=0x13f5280 <mysql_bin_log>, thd=0x42fe510, all=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/binlog.cc:7086
#7  0x00000000005b6ad0 in ha_commit_trans (thd=thd@entry=0x42fe510, all=all@entry=false, ignore_global_read_lock=ignore_global_read_lock@entry=false) at /root/percona-server-5.6.45-86.1/sql/handler.cc:1512
#8  0x000000000079bd39 in trans_commit_stmt (thd=thd@entry=0x42fe510) at /root/percona-server-5.6.45-86.1/sql/transaction.cc:436
#9  0x000000000070bb6b in mysql_execute_command (thd=thd@entry=0x42fe510) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:5676
#10 0x000000000070f568 in mysql_parse (thd=thd@entry=0x42fe510, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7f5344983650, update_userstat=update_userstat@entry=false) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:7062
#11 0x0000000000710b74 in dispatch_command (command=<optimized out>, thd=0x42fe510, packet=0x42fa0a1 "", packet_length=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1524
#12 0x0000000000712b41 in do_command (thd=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1093
#13 0x00000000006d764d in do_handle_one_connection (thd_arg=thd_arg@entry=0x430f670) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1588
#14 0x00000000006d7699 in handle_one_connection (arg=arg@entry=0x430f670) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1492
#15 0x000000000095c9c3 in pfs_spawn_thread (arg=0x4317c20) at /root/percona-server-5.6.45-86.1/storage/perfschema/pfs.cc:1861
#16 0x00007f5c33694dd5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f5c3235a02d in clone () from /lib64/libc.so.6



Thread 76 (Thread 0x7f5bdc0a5700 (LWP 446626)):
#0  0x00007f5c3369b4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5c33696e01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f5c33696da2 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000008dd6a3 in inline_mysql_mutex_lock (src_line=2195, src_file=0xd3e600 "/root/percona-server-5.6.45-86.1/sql/rpl_master.cc", that=0x4318f18) at /root/percona-server-5.6.45-86.1/include/mysql/psi/mysql_thread.h:690
#4  kill_zombie_dump_threads (thd=thd@entry=0x44fc760) at /root/percona-server-5.6.45-86.1/sql/rpl_master.cc:2195
#5  0x00000000008dd786 in com_binlog_dump (thd=thd@entry=0x44fc760, packet=packet@entry=0x44fff11 "\350;", packet_length=packet_length@entry=26) at /root/percona-server-5.6.45-86.1/sql/rpl_master.cc:761
#6  0x0000000000710907 in dispatch_command (command=COM_BINLOG_DUMP, thd=0x44fc760, packet=0x44fff11 "\350;", packet_length=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1739
#7  0x0000000000712b41 in do_command (thd=<optimized out>) at /root/percona-server-5.6.45-86.1/sql/sql_parse.cc:1093
#8  0x00000000006d764d in do_handle_one_connection (thd_arg=thd_arg@entry=0x42fe510) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1588
#9  0x00000000006d7699 in handle_one_connection (arg=arg@entry=0x42fe510) at /root/percona-server-5.6.45-86.1/sql/sql_connect.cc:1492
#10 0x000000000095c9c3 in pfs_spawn_thread (arg=0x4306ac0) at /root/percona-server-5.6.45-86.1/storage/perfschema/pfs.cc:1861
#11 0x00007f5c33694dd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f5c3235a02d in clone () from /lib64/libc.so.6

各个线程正在进行任务分析:

Thread 59 为binlog_send线程,在等LOCK_log锁。send_binlog(binlog_dump)线程要发送binlog,需要等74号线程完成后发送的信号量唤醒,或者等信量超时自己唤醒,但是线程间信号同步机制要求不管种方式唤醒都必须加上LOCK_log锁,如下图,此时它在等74号线程结束。

Thread 70 在做kill binlog_dump连接动作,已经持有 LOCK_thd_data锁,并通知目标线程关闭,等待目标线程的关闭中。

Thread 71 在做show status动作,持有 LOCK_status锁,在等LOCK_thread_count锁。但它要读status信息所以加了LOCK_status锁,而又要统计连接数,因此要加LOCK_thread_count锁。

Thread 74在做commit 动作,持有 LOCK_log锁,在等 LOCK_status锁。 执行commit时因为要生成binlog,会一直持有LOCK_log锁,直到binlog生成完成通过信号量唤醒其它等待线程再放锁,如果单个事务生成的Binlog很多,比方说有大事务,那这个锁持有时间会很长。因为在写完binlog后还要更新status时,所以需要加LOCK_status锁,在此时LOCK_log并未放锁。74号线程持有 LOCK_log等LOCK_status锁。关键代码逻辑如下图:

Thread 76也在做kill binlog_dump连接动作,持有 LOCK_thread_count 锁,在等要kill的目标线程的LOCK_thd_data锁。

这里的70和76线程是binlog同步线程被多次发起关闭的两个不同阶段,如下图所示:

76号线程加了LOCK_thread_count 锁,要加的锁被70号线程持有了,70号线程在等59号binlog send线程自己关闭,59号要等74号的LOCK_log锁,74号在等71号的LOCK_status锁,71号的在等76号的LOCK_thread_count 锁,死锁链形成,76->70->59->74->71->76,五个线程之间的相爱相杀,导致无法自我恢复,完全hang死,全部在做lock_wait,几乎不占cpu。

诱因分析

表面看,死锁发生就是事务提交正生成binlog写文件时,刚好有连接在做show status动作,又刚好被两次以上触发关闭Send binlog线程,又刚好有一个要关闭的复制线程在等待中,互相等待,无法恢复。好像需要非常多的巧合,应该很难发生,但实际上出现的次数并不少。

在该实例上,大事务比较多,经常有几G甚至几十G的大事务,那么生成这个事务的binlog文件时间会达到几十秒甚至几分钟,这个过程中会一直拿LOCK_log锁,send binlog线程会一直等这个锁,当然这时是整体是正常状态。如果有show global status语句过来,也能顺利执行完,没有线程拿它的锁,数据库后台依旧正常。紧接着如果这时有关闭binlog dump线程的指令过来,第一个线程会等在目标线程关闭代码处,第二个关闭binlog dump线程指令会等在第一个关闭指令自己拿的锁处,但此时仍然可以恢复,只要大事务完成写binlog,释放LOCK_log锁,第一个关闭binlog dump线程指令能正常完成,第二个关闭binlog dump线程也能正常跑完,重建binlog dump线程,开始发送binlog到备机。但是,当第二个关闭binlog dump线程没跑完时,此时若来一个show global status语句,那么死锁环就触发了。在这种情况下,只要事务足够大,生成binlog文件时间长,一直持有LOCK_log锁不放,有show global status语句,又有多次关闭binlog dump线程指令,这种死锁就容易发生。

根因深挖

大事务由业务产生,当大事务在SSD盘上生成几个G到几十G的binlog文件往往几秒到几分钟,那么最后生成binlog文件时就一定会持续占LOCK_log锁直到写完。

show global status语句为监控采集语句,我们15秒一次,这是正常语句。

send binlog线程等待锁也正常流程,5.6上操作同个binlog文件会受LOCK_log互斥,读时不能写,写时不能读,这是内部逻辑。

关闭binlog dump线程这不是一个常见操作,Master上的关闭操作是由slave发起,当备机上有stop slave时,会发指令到master上时会调用kill binlog_dump指令来关闭,但是该实例出现卡死时并没有人发起stop salve命令,那这个指令是从何而来呢?当slave机超过心跳时间,没有收到master的任何binlog或者心跳时,salve会尝试重建主从,此时就会发kill指令再重建。

而Master在生成大事务时,send线程一直在等待,即不能发送binlog也不能发心跳到备机,备机等待超过心跳时间未收到主机任何信息便会重建主从,这个心跳时间为 slave_net_timeout 参数值的一半,默认值为3600秒,即心跳默认时间为1800秒,按这个默认值设置的时间,即使hdd盘也能写上百G文件,几乎不会因为写一个binlog事务一直占锁到超过心跳时间情况。

查看slave主从建立信息发现参数slave_net_timeout 设置为4秒,也就是心跳为时间2秒,那么只要2秒未收到master上的binlog或心跳,slave就会通知master重建主从,然后Master会kill复制链接,然后在这一系巧合叠加下最终导致该问题。

优化方向

运维层面:

业务逻辑调整,减少事务大小,MySQL对大事务不友好,可能会导致很多问题。

调大slave_net_timeout 的值,使其心跳时间变长,在有大事务的业务中不至于频繁重建主从。

内核层面:

从上面可以看出,LOCK_log锁是关键点,这个锁是保护binlog的,不管是写binlog,还是要发送binlog都需要加这个锁,也就是说读写是冲突的。而在一主多从架构中,会有多个读binlog的线程,会加剧这个锁冲突,这里是否可以在代码架构上改进优化呢?显然是可以的。这是一个典型的生产者和消费者模型,完全可以做到读写不冲突,只要维护一个binlog位点就可以,只要读的位置不超过已经写完的binlog位点就不会有问题,完全没必要读写互斥,当然官方也意识到这个问题,在5.7版本上已经做了这个优化重构了,逻辑见下面代码图:

5.7版本更新维护的是位点,不再是文件级别锁,读写不再互斥,升级到5.7版本也可以避免该问题。

最后,还有一个小问题,为什么登陆时会卡住无反应呢?原因也很简单,新建连接需要加LOCK_thread_count 锁,这个锁已经无法获得了,逻辑见下面代码图:

小结

本案例中,大事务是关键诱因,对于大事务,即使MySQL现在已经进化到了8.0,依旧容易引发问题。TP业务应该尽可能让事务变小,小事务高并发,MySQL数据库在整个业务体系中重存储轻计算,复杂的业务逻辑尽在上层实现,或者尽量简化再到数据库。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值