背景
有一个短信表:
CREATE TABLE `t_msg` (
`ID` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`Mobile` VARCHAR(20) NOT NULL COMMENT '手机号',
`Content` VARCHAR(1000) NOT NULL COMMENT '短信内容',
`Status` SMALLINT(6) NOT NULL COMMENT '短信状态;1:待发送;2:发送中;3:成功;4:失败',
PRIMARY KEY (`ID`),
INDEX `Status` (`Status`)
)
ENGINE=InnoDB
;
有服务往该表中插入待发送的短信,然后通过服务A来发送,服务A是多实例部署,其发送逻辑如下:
1、在同一RR事务中执行:
begin;
select id from t_msg where status = 1 or status = 4 limit 10 for update;
-- for each id:
update t_msg set status = 2 where id = #{id};
commit;
2、并发执行:
// http调用短信运营商发送短信
if(发送成功){
update t_msg set status = 3 where id = #{id};
}else{
update t_msg set status = 4 where id = #{id};
}
可以开出已考虑到了多线程和多实例部署可能带来的线程安全问题,如果不通过select for update来加锁,可能多个实例会同时发送同一条短信,而通过select for update可以使得后查询的实例被阻塞;在将状态更新为2事务提交,则释放锁,然后其他实例会查询到新的记录,如果发送失败的则重试。
这样看似乎没啥问题,但是运行一段时间后,MySQL发现了死锁:
LATEST DETECTED DEADLOCK
------------------------
2020-09-07 14:54:37 0x7f059c0f5700
*** (1) TRANSACTION:
TRANSACTION 23429369, ACTIVE 35 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 922, OS thread handle 139662114862848, query id 9408791 localhost root Sending data
select * from t_msg where status =1 or status=4 limit 1 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108669 page no 4 n bits 80 index Status of table `sms`.`t_msg` trx id 23429369 lock_mode X waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 2; hex 8004; asc ;;
1: len 8; hex 0000000000000067; asc g;;
*** (2) TRANSACTION:
TRANSACTION 23429368, ACTIVE 44 sec updating or deleting
mysql tables in use 1, locked 1
6 lock struct(s), heap size 1136, 5 row lock(s), undo log entries 1
MySQL thread id 928, OS thread handle 139662069815040, query id 9408798 localhost root updating
update t_msg set Status=2
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 108669 page no 4 n bits 80 index Status of table `sms`.`t_msg` trx id 23429368 lock_mode X
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 2; hex 8004; asc ;;
1: len 8; hex 0000000000000067; asc g;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108669 page no 4 n bits 80 index Status of table `sms`.`t_msg` trx id 23429368 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
0: len 2; hex 8003; asc ;;
1: len 8; hex 0000000000000065; asc e;;
*** WE ROLL BACK TRANSACTION (1)
日志解读
事务1
在执行【select * from t_msg where status =1 or status=4 limit 1 for update】时等待索引Status上的锁:
0: len 2; hex 8004; asc ;;
1: len 8; hex 0000000000000067; asc g;;
即Status=4,id=103的行锁,这里的hex 0000000000000067 是16进制,可在linux系统中执行如下命令转换为10进制:
echo $((0x67))
103
事务2
在执行【update t_msg set status = 2 where id = 103】时,持有索引Status上的行锁:
0: len 2; hex 8004; asc ;;
1: len 8; hex 0000000000000067; asc g;;
即Status=4,id=103的行锁,同时等待索引Status上的间隙锁(gap before rec insert intention waiting):
0: len 2; hex 8003; asc ;;
1: len 8; hex 0000000000000066; asc f;;
即Status=3,id=102前的间隙锁。
让人不解的两个SQL均不涉及状态3,为什么事务2会等待状态3前的间隙锁呢?如果您能解释这个问题,那么您应该已经知道死锁发生的原因了。
场景复现
实验环境:
mysql> show variables like "%iso%";
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| tx_isolation | REPEATABLE-READ |
+---------------+-----------------+
mysql> desc test;
+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| value | int(11) | NO | MUL | 0 | |
+-------+---------+------+-----+---------+----------------+
mysql> select * from test;
+----+-------+
| id | value |
+----+-------+
| 1 | 3 |
| 2 | 3 |
| 3 | 3 |
| 4 | 4 |
+----+-------+
开启两个Session。
会话1执行:
mysql> begin;select * from test where value=1 or value=4 for update;
Query OK, 0 rows affected (0.00 sec)
+----+-------+
| id | value |
+----+-------+
| 4 | 4 |
+----+-------+
会话2执行:
mysql> select * from test where value=1 or value=4 for update;
Query OK, 0 rows affected (0.00 sec)
可以发现会话2的事务被阻塞;
这时会话1执行:
mysql> update test set value=2 where id=4;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
然后发现会话2打印出现内容:
mysql> select * from test where value=1 or value=4 for update;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
这时执行show engine innodb status发现死锁日志:
LATEST DETECTED DEADLOCK
------------------------
2020-09-16 20:53:02 0x7f059c609700
*** (1) TRANSACTION:
TRANSACTION 23471780, ACTIVE 72 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 2056, OS thread handle 139662076204800, query id 10165403 localhost root Sending data
select * from test where value=1 or value=4 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108664 page no 4 n bits 80 index value of table `test`.`test` trx id 23471780 lock_mode X waiting
Record lock, heap no 6 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
0: len 4; hex 80000004; asc ;;
1: len 4; hex 80000004; asc ;;
*** (2) TRANSACTION:
TRANSACTION 23471779, ACTIVE 128 sec updating or deleting
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1136, 6 row lock(s), undo log entries 1
MySQL thread id 2057, OS thread handle 139662075139840, query id 10165405 localhost root updating
update test set value=2 where id=4
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 108664 page no 4 n bits 80 index value of table `test`.`test` trx id 23471779 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
Record lock, heap no 6 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
0: len 4; hex 80000004; asc ;;
1: len 4; hex 80000004; asc ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 108664 page no 4 n bits 80 index value of table `test`.`test` trx id 23471779 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 4 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 4; hex 80000003; asc ;;
1: len 4; hex 80000001; asc ;;
*** WE ROLL BACK TRANSACTION (1)
可以发现和生产告警是一样的死锁!
原理解析
总结来说是:间隙锁不互斥导致死锁!
会话1在执行 select * from test where value=1 or value=4 for update时,首先执行**select * from test where value = 1 ,这时因为value=1本身不存在,所以首先定位到 value = 3 的叶节点 ,此时对(无穷小, 3)加间隙锁,然后发现该节点不等于1则停止向右遍历,然后再执行select * from test where value = 4 **,这时定位到并返回 value=4 的叶节点。
会话2也执行上述select查询语句,同样也能在(无穷小, 3)加间隙锁(不互斥),然后会等待 value = 4 这个叶节点上的行锁。
然后会话1 执行update status = 2 where id = 4,则需要在(无穷小, 3)插入value=2的叶节点,所以等待会话2持有的间隙锁!
可以看到,会话1和会话2 互相等待!
备注:INNDB加锁首先定位到等于或者第一个大于目标值的叶子节点!
解决方案
select和update语句的事务隔离级别改为RC隔离级别。
实际上根本不需要RR隔离级别,因为我们想要的是锁住当前所有value=1和value=4的行,我们不需要在事务执行期间禁止其他事务插入value=1或value=4的行。
经验总结
之所以出现上述问题,还是因为在写SQL时未考虑需要的隔离级别,通常来说RC更符合人类思维习惯,只有真的需要防止其他事务插入时才需要RR。
问题思考
假设短信发送肯定不会失败,即不会出现status=4的行,还会发生死锁么?
参考答案:不会!
如果当前存在status=1的行,则加锁情况如下:
如果事务1先执行,则占有上面全部的锁,
然后事务2将只能占有(无穷小,1)间隙锁,等到1上的行锁,
因此事务1执行update status = 2,将向(1,2)或者(2,3)之间插入,不会遇到阻碍!
如果当前不存在status=1等行,则压根不会执行update。