Select for update就安全了么?间隙锁导致死锁

背景

有一个短信表:

CREATE TABLE `t_msg` (
	`ID` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
	`Mobile` VARCHAR(20) NOT NULL COMMENT '手机号',
	`Content` VARCHAR(1000) NOT NULL COMMENT '短信内容',
	`Status` SMALLINT(6) NOT NULL COMMENT '短信状态;1:待发送;2:发送中;3:成功;4:失败',
	PRIMARY KEY (`ID`),
	INDEX `Status` (`Status`)
)
ENGINE=InnoDB
;

有服务往该表中插入待发送的短信,然后通过服务A来发送,服务A是多实例部署,其发送逻辑如下:

1、在同一RR事务中执行:

begin;
select id from t_msg where status = 1 or status = 4 limit 10  for update;
-- for each id:
update t_msg set status = 2 where id = #{id};
commit;

2、并发执行:

// http调用短信运营商发送短信
if(发送成功){
	update t_msg set status = 3 where id = #{id};
}else{
	update t_msg set status = 4 where id = #{id};
}

可以开出已考虑到了多线程和多实例部署可能带来的线程安全问题,如果不通过select for update来加锁,可能多个实例会同时发送同一条短信,而通过select for update可以使得后查询的实例被阻塞;在将状态更新为2事务提交,则释放锁,然后其他实例会查询到新的记录,如果发送失败的则重试。

这样看似乎没啥问题,但是运行一段时间后,MySQL发现了死锁:

LATEST DETECTED DEADLOCK
------------------------
2020-09-07 14:54:37 0x7f059c0f5700
*** (1) TRANSACTION:
TRANSACTION 23429369, ACTIVE 35 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 922, OS thread handle 139662114862848, query id 9408791 localhost root Sending data
select * from t_msg where  status =1 or status=4  limit 1 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108669 page no 4 n bits 80 index Status of table `sms`.`t_msg` trx id 23429369 lock_mode X waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 2; hex 8004; asc   ;;
 1: len 8; hex 0000000000000067; asc        g;;

*** (2) TRANSACTION:
TRANSACTION 23429368, ACTIVE 44 sec updating or deleting
mysql tables in use 1, locked 1
6 lock struct(s), heap size 1136, 5 row lock(s), undo log entries 1
MySQL thread id 928, OS thread handle 139662069815040, query id 9408798 localhost root updating
update t_msg set Status=2
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 108669 page no 4 n bits 80 index Status of table `sms`.`t_msg` trx id 23429368 lock_mode X
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 2; hex 8004; asc   ;;
 1: len 8; hex 0000000000000067; asc        g;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108669 page no 4 n bits 80 index Status of table `sms`.`t_msg` trx id 23429368 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 2; hex 8003; asc   ;;
 1: len 8; hex 0000000000000065; asc        e;;

*** WE ROLL BACK TRANSACTION (1)

日志解读

事务1

在执行【select * from t_msg where status =1 or status=4 limit 1 for update】时等待索引Status上的锁:

 0: len 2; hex 8004; asc   ;;
 1: len 8; hex 0000000000000067; asc        g;;

即Status=4,id=103的行锁,这里的hex 0000000000000067 是16进制,可在linux系统中执行如下命令转换为10进制:

echo $((0x67))
103

事务2

在执行【update t_msg set status = 2 where id = 103】时,持有索引Status上的行锁:

 0: len 2; hex 8004; asc   ;;
 1: len 8; hex 0000000000000067; asc        g;;

即Status=4,id=103的行锁,同时等待索引Status上的间隙锁(gap before rec insert intention waiting):

 0: len 2; hex 8003; asc   ;;
 1: len 8; hex 0000000000000066; asc        f;;

即Status=3,id=102前的间隙锁。

让人不解的两个SQL均不涉及状态3,为什么事务2会等待状态3前的间隙锁呢?如果您能解释这个问题,那么您应该已经知道死锁发生的原因了。

场景复现

实验环境:

mysql> show variables like "%iso%";
+---------------+-----------------+
| Variable_name | Value           |
+---------------+-----------------+
| tx_isolation  | REPEATABLE-READ |
+---------------+-----------------+
mysql> desc test;
+-------+---------+------+-----+---------+----------------+
| Field | Type    | Null | Key | Default | Extra          |
+-------+---------+------+-----+---------+----------------+
| id    | int(11) | NO   | PRI | NULL    | auto_increment |
| value | int(11) | NO   | MUL | 0       |                |
+-------+---------+------+-----+---------+----------------+
mysql> select * from test;      
+----+-------+
| id | value |
+----+-------+
|  1 |     3 |
|  2 |     3 |
|  3 |     3 |
|  4 |     4 |
+----+-------+

开启两个Session。

会话1执行:

mysql> begin;select * from test where value=1 or value=4 for update;  
Query OK, 0 rows affected (0.00 sec)

+----+-------+
| id | value |
+----+-------+
|  4 |     4 |
+----+-------+

会话2执行:

mysql> select * from test where value=1 or value=4 for update; 
Query OK, 0 rows affected (0.00 sec)

可以发现会话2的事务被阻塞;

这时会话1执行:

mysql> update test set value=2 where id=4;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

然后发现会话2打印出现内容:

mysql> select * from test where value=1 or value=4 for update;      
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

这时执行show engine innodb status发现死锁日志:

LATEST DETECTED DEADLOCK
------------------------
2020-09-16 20:53:02 0x7f059c609700
*** (1) TRANSACTION:
TRANSACTION 23471780, ACTIVE 72 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 2056, OS thread handle 139662076204800, query id 10165403 localhost root Sending data
select * from test where value=1 or value=4 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108664 page no 4 n bits 80 index value of table `test`.`test` trx id 23471780 lock_mode X waiting
Record lock, heap no 6 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 4; hex 80000004; asc     ;;
 1: len 4; hex 80000004; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 23471779, ACTIVE 128 sec updating or deleting
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1136, 6 row lock(s), undo log entries 1
MySQL thread id 2057, OS thread handle 139662075139840, query id 10165405 localhost root updating
update test set value=2 where id=4
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 108664 page no 4 n bits 80 index value of table `test`.`test` trx id 23471779 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

Record lock, heap no 6 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 4; hex 80000004; asc     ;;
 1: len 4; hex 80000004; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108664 page no 4 n bits 80 index value of table `test`.`test` trx id 23471779 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 4 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 80000003; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** WE ROLL BACK TRANSACTION (1)

可以发现和生产告警是一样的死锁!

原理解析

总结来说是:间隙锁不互斥导致死锁!

会话1在执行 select * from test where value=1 or value=4 for update时,首先执行**select * from test where value = 1 ,这时因为value=1本身不存在,所以首先定位到 value = 3 的叶节点 ,此时对(无穷小, 3)加间隙锁,然后发现该节点不等于1则停止向右遍历,然后再执行select * from test where value = 4 **,这时定位到并返回 value=4 的叶节点。

会话2也执行上述select查询语句,同样也能在(无穷小, 3)加间隙锁(不互斥),然后会等待 value = 4 这个叶节点上的行锁

然后会话1 执行update status = 2 where id = 4,则需要在(无穷小, 3)插入value=2的叶节点,所以等待会话2持有的间隙锁

可以看到,会话1和会话2 互相等待!

备注:INNDB加锁首先定位到等于或者第一个大于目标值的叶子节点!

在这里插入图片描述

解决方案

select和update语句的事务隔离级别改为RC隔离级别。

实际上根本不需要RR隔离级别,因为我们想要的是锁住当前所有value=1和value=4的行,我们不需要在事务执行期间禁止其他事务插入value=1或value=4的行。

经验总结

之所以出现上述问题,还是因为在写SQL时未考虑需要的隔离级别,通常来说RC更符合人类思维习惯,只有真的需要防止其他事务插入时才需要RR。

问题思考

假设短信发送肯定不会失败,即不会出现status=4的行,还会发生死锁么?

参考答案:不会!

如果当前存在status=1的行,则加锁情况如下:

在这里插入图片描述

如果事务1先执行,则占有上面全部的锁,

然后事务2将只能占有(无穷小,1)间隙锁,等到1上的行锁,

因此事务1执行update status = 2,将向(1,2)或者(2,3)之间插入,不会遇到阻碍!

如果当前不存在status=1等行,则压根不会执行update。

  • 5
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值