mysql死锁占用线程池_记一次数据库死锁导致RPC调用超时, 容器实例反复重启的线上故障...

现象

2020年7月24号是周五, 晚上8点半开始, 很多HTTP接口出现超时, 查看监控发现RPC调用出现了大量的超时情况. 通过监控日志打印出来的业务逻辑处理耗时来看,业务逻辑耗时并不长。 一度怀疑是RPC框架在处理过高的QPS时出现性能问题.

MySQL监控

直到看到MySQL的监控信息才发现, MySQL实例的线程数和CPU都飙升了非常高. 其中运行中的线程数从正常情况的5~6飙升至峰值431,MySQL实例的CPU飙升至峰值1547%,慢查询也出现了15次,而事务处理的TPS却并没有提高多少,只是从6升到了9.图一 QPS飙升至431.图二 TPS从6升到9图三 CPU从 27%逐渐飙升至1547%图四 慢查询出现15次图五运行线程飙升至431.

原因追溯

明白了应该是MySQL实例发生了慢查询,导致拖慢了业务逻辑的处理,而这些慢的处理请求占用了RPC服务端线程池,导致其它的RPC请求无法被处理,所以在客户端侧看到了大量的超时请求。之所以通过日志没有观察到,是因为只观察到了日志中的部分处理耗时,漏掉了关键的一些导致业务处理变慢的处理耗时。

到底是什么请求导致了MySQL的慢查询呢?通过MySQL提供的一键综合诊断功能,对集群当前的锁进行检测发现如下死锁。图六 数据库实例综合诊断

摘录其中一条具体的SQL如下:

UPDATE creation_red_envelope_lottery t1 SET t1.user_id =559276217082693, t1.activity_id =310, t1.status =1, t1.reward_id ='836552413749253875' WHERE t1.id = (SELECT id FROM (SELECT id FROM creation_red_envelope_lottery t2 WHERE (t2.deleted = 0 AND t2.red_envelope_setting_id =168 AND t2.strategy_id =0) AND (SELECT ifnull(SUM(t3.price),0) FROM creation_red_envelope_lottery t3 WHERE to_days(modified_time) = to_days(now()) AND t3.status <>0) < 510000 AND (SELECT ifnull(SUM(t4.price),0) FROM creation_red_envelope_lottery t4 WHERE DATE_FORMAT(modified_time, '%Y%m') = DATE_FORMAT(CURDATE() , '%Y%m') AND t4.status <>0) < 6000000 AND (SELECT ifnull(SUM(t5.price),0) FROM creation_red_envelope_lottery t5 WHERE DATE_FORMAT(modified_time, '%Y%m') = DATE_FORMAT(CURDATE() , '%Y%m') AND t5.user_id =559276217082693 AND t5.status <>0) < 80000 AND NOT EXISTS (SELECT 1 FROM creation_red_envelope_lottery t6 WHERE t6.reward_id ='836552413749253875')AND t2.status = 0 LIMIT 1) a)

根据如上的SQL语句,原来是拆红包的逻辑出现了锁等待。

进一步诊断如下图七 智能DBA诊断

摘出其中出现死锁的两条SQL语句的执行情况如下:

{

"host": "10.49.56.123",

"port": 3533,

"deadlock_time": "2020-07-25 00:04:11",

"deadlock_id": "0x7f5e69863700",

"has_deaklock": 1,

"trx": [

{

"state": "starting index read",

"id": "157506765",

"thread_id": "155553100",

"query_id": "17616972479",

"client_ip": "10.62.40.110",

"db": "muses",

"sql": "UPDATE creation_red_envelope_lottery t1 SET t1.user_id =833216680297260, t1.activity_id =336, t1.status =1, t1.reward_id =\\'836557418986779319\\' WHERE t1.id = (SELECT id FROM (SELECT id FROM creation_red_envelope_lottery t2 WHERE (t2.deleted = 0 AND t2.red_envelope_setting_id =168 AND t2.strategy_id =3) AND (SELECT ifnull(SUM(t3.price),0) FROM creation_red_envelope_lottery t3 WHERE to_days(modified_time) = to_days(now()) AND t3.status <>0) < 510000 AND (SELECT ifnull(SUM(t4.price),0) FROM creation_red_envelope_lottery t4 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t4.status <>0) < 6000000 AND (SELECT ifnull(SUM(t5.price),0) FROM creation_red_envelope_lottery t5 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t5.user_id =833216680297260 AND t5.status <>0) < 80000 AND NOT EXISTS (SELECT 1 FROM creation_red_envelope_lottery t6 WHERE t6.rew",

"execute_result": "normal"

},

{

"state": "starting index read",

"id": "157506794",

"thread_id": "155551800",

"query_id": "17616974824",

"client_ip": "10.52.40.66",

"db": "muses",

"sql": "UPDATE creation_red_envelope_lottery t1 SET t1.user_id =833216680297260, t1.activity_id =336, t1.status =1, t1.reward_id =\\'836557419240534136\\' WHERE t1.id = (SELECT id FROM (SELECT id FROM creation_red_envelope_lottery t2 WHERE (t2.deleted = 0 AND t2.red_envelope_setting_id =168 AND t2.strategy_id =3) AND (SELECT ifnull(SUM(t3.price),0) FROM creation_red_envelope_lottery t3 WHERE to_days(modified_time) = to_days(now()) AND t3.status <>0) < 510000 AND (SELECT ifnull(SUM(t4.price),0) FROM creation_red_envelope_lottery t4 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t4.status <>0) < 6000000 AND (SELECT ifnull(SUM(t5.price),0) FROM creation_red_envelope_lottery t5 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t5.user_id =833216680297260 AND t5.status <>0) < 80000 AND NOT EXISTS (SELECT 1 FROM creation_red_envelope_lottery t6 WHERE t6.rewa",

"execute_result": "rollback"

}

]

}

可以清晰地看到是这两条SQL更新语句的查询条件是有重合的部分。由于两条语句是同时执行,先后锁住了不同的区域,而被锁住的区域又是对方SQL语句想要锁住的区域,因此出现了死锁。

由于我们的MySQL实例是多个项目共享的,因此单个项目的慢查询导致实例的CPU升高之后,几乎造成了所有的项目写入MySQL都变慢超时。

回顾一下问题出现之后我们线上处理的步骤:

1)在明确了是数据库死锁导致的问题之后,紧急联系DBA同学开启MySQL服务端的锁检测。当检测到死锁或者锁等待导致的慢查询时会自动杀死线程。

2)修复导致慢查询的SQL语句,将一次大的查询SQL语句修改为两次查询。

3)重新上线。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值