记一次项目上Mysql死锁问题的定位

最近项目上接了一个需求,处理系统中出现的慢SQL,排查过程中通过druid的监控看到了系统中不少死锁的报错提示,即

Deadlock found when trying to get lock; try restarting transaction

  1. 首先根据报错信息行数定位到具体代码报错位置,固定到service层,可以看到如下代码
  2. CTReportDaily dbCTReportDaily = ctReportDailyDao.getByCtAndShopAndDay(bloc, project,
            collectionTerminalNo, "-", tradeBalanceTime);
        if (dbCTReportDaily == null) {
          CTReportDaily ctReportDaily = new CTReportDaily();
          ctReportDaily.setUuid(UUID.randomUUID().toString());
          ctReportDaily.setBloc(bloc);
          ctReportDaily.setProject(project);
          ...
          ctReportDailyDao.insert(ctReportDaily);
          LogUtil.info(logger,
              MainEvent.fetchMainEvent(bloc, project, collectionTerminalNo, EventStage.T41));
        } else {
          dbCTReportDaily.setInitTradeCount(initTradeCount);
          ctReportDailyDao.updateyTicketData(dbCTReportDaily);
          LogUtil.info(logger,
              MainEvent.fetchMainEvent(bloc, project, collectionTerminalNo, EventStage.T41));
        }

    简单理解为按条件查询 CTReportDaily表记录,查询不到新增,查询到则更新。此处getByCtAndShopAndDay()查询已走唯一键索引,具体报错在 ctReportDaily.insert(ctReportDaily);这一行报死锁错误

  3. 开始简单的以为查询没有锁住行,导致并发更新该行而死锁,于是在 getByCtAndShopAndDay()查询中增加 for update,使其由快照读变成当前读,即加上一个排它锁,其他事务不允许读不允许写。但是加上后经过压测发现死锁问题并没有解决。。。
  4. 仔细研究日志后发现,另外一处报死锁的日志,定位到代码如下
  5. CTReportDaily dbCTReportDaily = ctReportDailyDao.getByCtAndShopAndDay(bloc, project,
              collectionTerminalNo, "-", printDate);
          if (dbCTReportDaily == null) {
            CTReportDaily ctReportDaily = new CTReportDaily();
            ctReportDaily.setUuid(UUID.randomUUID().toString());
            ctReportDaily.setBloc(bloc);
            ctReportDaily.setProject(project);
            ...
            ctReportDailyDao.insert(ctReportDaily);
            LogUtil.info(logger,
                MainEvent.fetchMainEvent(bloc, project, collectionTerminalNo, EventStage.T40));
          } else {
            if (cTReportTicket.getTicketCount() != null) {
              dbCTReportDaily.setTicketCount(cTReportTicket.getTicketCount());
            }
            ctReportDailyDao.updateByTicket(dbCTReportDaily);
            LogUtil.info(logger,
                MainEvent.fetchMainEvent(bloc, project, collectionTerminalNo, EventStage.T40));
          }

    其实此块代码和上一块代码逻辑几乎一致,只是在不同的事务中,由不同的消费端去调用的。日志中可以清楚地看到一个是T40,另一个是T41

  6. 此时察觉到死锁可能不是发生在update 的时候,而是发生在 insert 的时候,为了验证自己的猜想,开启Mysql的标准监控和锁监控,拿出死锁日志,一探究竟!
  7. 
    -- 开启标准监控
    set GLOBAL innodb_status_output=ON;
     
    -- 开启锁监控
    set GLOBAL innodb_status_output_locks=ON;
    
    -- 输出详细死锁信息
    set GLOBAL innodb_print_all_deadlocks=ON;
    
    -- 查看死锁日志
    show engine innodb status
    

    设置好上面三个开关后,再次进行压测,执行 show engine innodb status ,拿出 死锁日志,我们一起来分析,日志如下(只截取部分):

  8. ------------------------
    LATEST DETECTED DEADLOCK
    ------------------------
    2020-11-18 10:58:07 7f1a205af700
    *** (1) TRANSACTION:
    TRANSACTION 517441311, ACTIVE 0 sec inserting
    mysql tables in use 1, locked 1
    LOCK WAIT 3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
    MySQL thread id 119, OS thread handle 0x7f1a2056e700, query id 4027136 172.17.0.1 alps update
    INSERT INTO alps.ctreportdaily
    (uuid, bloc, project, collectionTerminalNo, shopUuid, notSaleBillCount, ticketCount, tradeAmount, tradeCount, initTradeCount, offNetTimes, day, isETL)
    VALUES ('dfd176d7-c1f6-4841-82c5-6672f1556972','0000','0002','0175','-',0,1,0,0,0,0,'2020-09-28 12:57:16',0)
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    RECORD LOCKS space id 140 page no 231 n bits 320 index `uk_1` of table `alps`.`ctreportdaily` trx id 517441311 lock_mode X locks gap before rec insert intention waiting
    Record lock, heap no 206 PHYSICAL RECORD: n_fields 7; compact format; info bits 0
     0: len 4; hex 30303030; asc 0000;;
     1: len 4; hex 30303032; asc 0002;;
     2: len 4; hex 30313735; asc 0175;;
     3: len 1; hex 2d; asc -;;
     4: len 3; hex 8fc942; asc   B;;
     5: len 1; hex 80; asc  ;;
     6: len 30; hex 63646236386164662d623761662d343234372d613033382d646238363035; asc cdb68adf-b7af-4247-a038-db8605; (total 36 bytes);

    *** (2) TRANSACTION:
    TRANSACTION 517441309, ACTIVE 0 sec inserting
    mysql tables in use 1, locked 1
    3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
    MySQL thread id 118, OS thread handle 0x7f1a205af700, query id 4027143 172.17.0.1 alps update
    INSERT INTO alps.ctreportdaily
    (uuid, bloc, project, collectionTerminalNo, shopUuid, notSaleBillCount, ticketCount, tradeAmount, tradeCount, initTradeCount, offNetTimes, day, isETL)
    VALUES ('6a0aaf4d-28ce-4f61-9d91-b14ca156bb54','0000','0002','0175','-',0,0,0,0,1,0,'2020-09-28 12:57:16',0)
    *** (2) HOLDS THE LOCK(S):
    RECORD LOCKS space id 140 page no 231 n bits 320 index `uk_1` of table `alps`.`ctreportdaily` trx id 517441309 lock_mode X locks gap before rec
    Record lock, heap no 206 PHYSICAL RECORD: n_fields 7; compact format; info bits 0
     0: len 4; hex 30303030; asc 0000;;
     1: len 4; hex 30303032; asc 0002;;
     2: len 4; hex 30313735; asc 0175;;
     3: len 1; hex 2d; asc -;;
     4: len 3; hex 8fc942; asc   B;;
     5: len 1; hex 80; asc  ;;
     6: len 30; hex 63646236386164662d623761662d343234372d613033382d646238363035; asc cdb68adf-b7af-4247-a038-db8605; (total 36 bytes);

    *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
    RECORD LOCKS space id 140 page no 231 n bits 320 index `uk_1` of table `alps`.`ctreportdaily` trx id 517441309 lock_mode X locks gap before rec insert intention waiting
    Record lock, heap no 206 PHYSICAL RECORD: n_fields 7; compact format; info bits 0
     0: len 4; hex 30303030; asc 0000;;
     1: len 4; hex 30303032; asc 0002;;
     2: len 4; hex 30313735; asc 0175;;
     3: len 1; hex 2d; asc -;;
     4: len 3; hex 8fc942; asc   B;;
     5: len 1; hex 80; asc  ;;
     6: len 30; hex 63646236386164662d623761662d343234372d613033382d646238363035; asc cdb68adf-b7af-4247-a038-db8605; (total 36 bytes);

    *** WE ROLL BACK TRANSACTION (2)

    1. 首先看事务一的信息:

      *** (1) TRANSACTION:
      TRANSACTION 517441311, ACTIVE 0 sec inserting

      ACTIVE 0 sec 表示事务活动时间,inserting 为事务当前正在运行的状态,可能的事务状态有:fetching rows,updating,deleting,inserting 等

    2. mysql tables in use 1, locked 1
      LOCK WAIT 3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1

      tables in use 1 表示有一个表被使用,locked 1 表示有一个表锁。LOCK WAIT 表示事务正在等待锁。2 row lock(s) 表示当前事务持有的行锁个数,undo log entries 1 表示当前事务有 1 个 undo log 记录

    3. INSERT INTO alps.ctreportdaily ......

      此处显示的是正在等待锁的SQL语句,下面的信息很重要

    4. *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 140 page no 231 n bits 320 index `uk_1` of table `alps`.`ctreportdaily` trx id 517441311 lock_mode X locks gap before rec insert intention waiting
      Record lock, heap no 206 PHYSICAL RECORD: n_fields 7; compact format; info bits 0

      这里显示的是事务正在等待什么锁。RECORD LOCKS 表示记录锁(并且可以看出要加锁的索引为 uk_1),lock_mode X locks gap before rec insert intention waiting 表示等待的是 插入意向锁。
      Mysql一共有四种类型的行锁:记录锁、间隙锁、Next-Key锁、插入意向锁。这四种锁对应的死锁日志各不相同,如下:

      1. 记录锁(LOCK_REC_NOT_GAP): lock_mode X locks rec but not gap
      2. 间隙锁(LOCK_GAP): lock_mode X locks gap before rec
      3. Next-key 锁(LOCK_ORNIDARY): lock_mode X
      4. 插入意向锁(LOCK_INSERT_INTENTION): lock_mode X locks gap before rec insert intention

      目前第一个事务告诉我们,在执行insert的时候,等待的是插入意向锁。接下来继续分析第二个事务:

    5. *** (2) TRANSACTION:
      TRANSACTION 517441309, ACTIVE 0 sec inserting
      mysql tables in use 1, locked 1
      3 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
      MySQL thread id 118, OS thread handle 0x7f1a205af700, query id 4027143 172.17.0.1 alps update
      INSERT INTO alps.ctreportdaily ...

      *** (2) HOLDS THE LOCK(S):
      RECORD LOCKS space id 140 page no 231 n bits 320 index `uk_1` of table `alps`.`ctreportdaily` trx id 517441309 lock_mode X locks gap before rec
      Record lock, heap no 206 PHYSICAL RECORD: n_fields 7; compact format; info bits 0
       

      *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 140 page no 231 n bits 320 index `uk_1` of table `alps`.`ctreportdaily` trx id 517441309 lock_mode X locks gap before rec insert intention waiting
      Record lock, heap no 206 PHYSICAL RECORD: n_fields 7; compact format; info bits 0

      可以看出,事务二和事务一的日志基本类似。事务二等待的也是插入意向锁,但是我们知道插入意向锁之间并不互斥,不过事务二日志多出了一部分 HOLDS THE LOCK(S),表示事务二持有了什么锁。这个锁往往就是导致事务一处于锁等待的原因!
      这里可以看到事务二正在等待索引 uk_1 上的插入意向锁,并且它已经持有了一个 间隙锁

    6. 到这里为止,似乎死锁原因已经定位到了。我们知道,Mysql在 REPEATABLE-READ 隔离级别下,使用间隙锁可以防止其他事务在这个范围内插入或修改记录,保证两次读取这个范围内的记录不会变,从而不会出现幻读现象。比如当根据主键查询一条记录,如果查询的记录不存在,其实也会加锁,此时加的就是间隙锁。接下来我们来说说插入意向锁。插入意向锁是一种特殊的间隙锁。这个锁表示插入的意向,只有在 INSERT 的时候才会有这个锁。插入意向锁只会和间隙锁或 Next-key 锁冲突。Mysql由此来避免Insert的时候幻读

  9. 通过死锁日志的分析,结合Mysql锁冲突的知识,我们定位到了问题所在。总结如下:在本例中,T40和T41是不同的事务,两个事务都在做同一件事,即根据主键查询记录,查询不到则新增一条新记录。当高并发下,事务一查询不到记录,在索引上加上间隙锁,此时事务二在insert,在索引上加上插入意向锁。由于两个锁冲突。导致死锁发生!

  10. 解决办法:找到问题后的解决办法暂时列了如下几种,如有更好的解决办法欢迎评论!

    1. 由于此处两个事务做的是同一件事,但是是由不同的Mq消费端所触发。通过减少一个消费端触发保证只有一个事务在操作。此方案治标不治本,当以后多节点部署时,同样会出现多消费端不同事务的问题。

    2. 本次死锁的关键原因其实是在于 间隙锁,如果我们可以将事务的隔离级别降低,设置成 RC 隔离级别,则不会使用间隙锁,也就不会出现此死锁问题。

    3. 通过Mysql触发器来操作,定时向表插入数据,保证查询时查到的都是有值 的记录。此方案也不建议,触发器过多会拖累数据库负担,且不方便日后问题排查。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值