Seata 全局锁冲突以及分支注册两次的问题解决记录

longchangfeng

已于 2023-05-16 18:46:46 修改

阅读量2.2k

点赞数 5

文章标签： mysql 数据库 java

于 2023-05-16 18:45:57 首次发布

本文链接：https://blog.csdn.net/longchangfeng/article/details/130711797

版权

Seata 全局锁冲突以及分支注册两次的问题解决记录

###问题发现

测试环境，调用接口超时错误，通过调用方的日志发现，事务执行插入记录时，seata获取全局锁失败, 然后立即翻seata-server的日志

日志如下

2023-04-20 14:01:26	
06:01:26.907  INFO --- [verHandlerThread_1_22_500] i.s.c.e.AbstractExceptionHandler         : this request cannot acquire global lock, you can let Seata retry by setting config [client.rm.lock.retryPolicyBranchRollbackOnConflict] = false or manually retry by yourself. request: xid=10.244.4.180:8091:2837568797668576195,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=undo_log:23534;user:606018;user_info:564480
2023-04-20 14:01:26	
06:01:26.906  INFO --- [verHandlerThread_1_22_500] i.s.s.s.db.lock.LockStoreDataBaseDAO     : Global lock batch acquire failed, xid 10.244.4.180:8091:2837568797668576195 branchId 2837568797668576206 pks [23534, 606018, 564480]

2023-04-20 14:01:26	
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'jdbc:mysql://xxxxxxx:xxxx/test^^^user^^^606018' for key 'PRIMARY'

通过日志发现，是seata server 在插入lock_table 记录行锁时，出现了row_key的主键冲突 606018 就是插入user表记录后生成的主键。随后seata server 不停的重试获取锁，但每次都会冲突，直致1分钟后事务超时（seata 事务默认时间是60s）进行回滚。

了解业务得知，此处是一个插入语句，按照理论讲，插入记录生成主键都是递增的，怎么可能出现冲突呢？

后边仔细看日志又发现，现有一个分支注册并获取了此行锁，然后另外一个分支又来注册获取这行锁，因此出了冲突。

2023-04-20 14:01:26	
06:01:26.903 ERROR --- [verHandlerThread_1_22_500] i.s.s.s.db.lock.LockStoreDataBaseDAO     : Global lock batch acquire error: Duplicate entry 'jdbc:mysql://xxxxxxx:xxxx/test^^^user^^^606018' for key 'PRIMARY'
2023-04-20 14:01:26	
,clientIp:10.244.4.1,vgroup:default_tx_group
2023-04-20 14:01:26	
06:01:26.898  INFO --- [     batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler         : SeataMergeMessage xid=10.244.4.180:8091:2837568797668576195,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=undo_log:23534;user:606018;user_info:564480
2023-04-20 14:01:26	
06:01:26.653  INFO --- [verHandlerThread_1_21_500] i.seata.server.coordinator.AbstractCore  : Register branch successfully, xid = 10.244.4.180:8091:2837568797668576195, branchId = 2837568797668576204, resourceId = jdbc:mysql://xxxxxxx:xxxx/test ,lockKeys = user:606018;user_info:564480
2023-04-20 14:01:26	
,clientIp:10.244.4.1,vgroup:default_tx_group
2023-04-20 14:01:26	
06:01:26.647  INFO --- [     batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler         : SeataMergeMessage xid=10.244.4.180:8091:2837568797668576195,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=user:606018;user_info:564480

所以起初排查的方向是怀疑业务使用不当造成的冲突，比如两个分支同时使用一行数据，但随后看代码得知，此处就是很简单的插入语句

随后，怀疑是某个版本的bug，因为seata github 上的bug修复记录十分频繁，因此尝试了升级版本。从1.5.2 升级到 1.6.1 之后，发现接口能调通了，虽然还是有偶尔的冲突异常，但是因为seata-server自己有重试机制，重试两三次就通过了。

为了查询更深层次的原因，翻看了1.5.2 到 1.6.1 的更新日录，发现了一个修改：

更改前：
   if (allBeforeImageEmpty()) {
            this.applicationData.put(SKIP_CHECK_LOCK, true);
   }
   
   更改后
 GlobalLockConfig globalLockConfig = GlobalLockConfigHolder.getCurrentGlobalLockConfig();
        // lock retry times > 1 & skip first check lock / before image is empty
        if ((globalLockConfig.getLockRetryTimes() == -1 || globalLockConfig.getLockRetryTimes() > 1)
            && (globalLockConfig.getLockStrategyMode() == LockStrategyMode.OPTIMISTIC || allBeforeImageEmpty())) {
            if (!applicationData.containsKey(SKIP_CHECK_LOCK)) {
                this.applicationData.put(SKIP_CHECK_LOCK, true);
            } else {
                this.applicationData.put(SKIP_CHECK_LOCK, false);
            }
        }

增删改只有插入插入操作的before image 为空，这里 allBeforeImageEmpty 意味着所有操作都是插入操作，则跳过锁检查

更新后的逻辑变为获取全局锁模式，如果重试次数为-1（无限重试）或者大于 1，并且锁模式是乐观锁或者全是插入操作，则判断如果是第一次插入，不检查锁，之后的操作全部检查锁。

那么检查锁会有什么操作呢？通过阅读代码，层层深入，我们看到了 LockStoreDataBaseDAO 是这么用的

@Override
    public boolean acquireLock(List<LockDO> lockDOs, boolean autoCommit, boolean skipCheckLock) {
        Connection conn = null;
        PreparedStatement ps = null;
        ResultSet rs = null;
        Set<String> dbExistedRowKeys = new HashSet<>();
        boolean originalAutoCommit = true;
        if (lockDOs.size() > 1) {
            lockDOs = lockDOs.stream().filter(LambdaUtils.distinctByKey(LockDO::getRowKey)).collect(Collectors.toList());
        }
        try {
            conn = lockStoreDataSource.getConnection();
            if (originalAutoCommit = conn.getAutoCommit()) {
                conn.setAutoCommit(false);
            }
            List<LockDO> unrepeatedLockDOs = lockDOs;

            //check lock
            if (!skipCheckLock) {
		   //如果不跳过检查锁，会走这
                boolean canLock = true;
                //拼装查询锁的sql，大概这样子 "select * from LOCK_TABLE_PLACE_HOLD where row_key in (?) order by status desc ";
                String checkLockSQL = LockStoreSqlFactory.getLogStoreSql(dbType).getCheckLockableSql(lockTable, lockDOs.size());
                ps = conn.prepareStatement(checkLockSQL);
                for (int i = 0; i < lockDOs.size(); i++) {
                    ps.setString(i + 1, lockDOs.get(i).getRowKey());
                }
                rs = ps.executeQuery();
                //当前的全局事务id
                String currentXID = lockDOs.get(0).getXid();
                boolean failFast = false;
                //如果查询到已经加锁
                while (rs.next()) {
                    判断表里已加的锁所属的事务 是否 与当前全局事务一致
                    String dbXID = rs.getString(ServerTableColumnsName.LOCK_TABLE_XID);
                    if (!StringUtils.equals(dbXID, currentXID)) {
                       //打印日志
                        if (LOGGER.isInfoEnabled()) {
                            String dbPk = rs.getString(ServerTableColumnsName.LOCK_TABLE_PK);
                            String dbTableName = rs.getString(ServerTableColumnsName.LOCK_TABLE_TABLE_NAME);
                            long dbBranchId = rs.getLong(ServerTableColumnsName.LOCK_TABLE_BRANCH_ID);
                            LOGGER.info("Global lock on [{}:{}] is holding by xid {} branchId {}", dbTableName, dbPk, dbXID, dbBranchId);
                        }
                        //非自动提交，如果已经加锁的状态是回滚，说明正在进行事务回滚，此时为了保证数据安全，不再重试，直接抛异常
                        if (!autoCommit) {
                            int status = rs.getInt(ServerTableColumnsName.LOCK_TABLE_STATUS);
                            if (status == LockStatus.Rollbacking.getCode()) {
                                failFast = true;
                            }
                        }
                        //不一致说明冲突，当前事务加锁失败
                        canLock = false;
                        break;
                    }
			 //如果事务一致，那么将锁的row_key 放入当前事务已经存在的锁里（一个事物有多行锁，通过dbExistedRowKeys 记录哪些锁已存在，哪些未加锁）
                    dbExistedRowKeys.add(rs.getString(ServerTableColumnsName.LOCK_TABLE_ROW_KEY));
                }
                //如果加锁失败，并且快速失败，则抛异常，否则返回加锁失败 false
                if (!canLock) {
                    conn.rollback();
                    if (failFast) {
                        throw new StoreException(new BranchTransactionException(LockKeyConflictFailFast));
                    }
                    return false;
                }
                
                // 如果锁已经在数据库存在了，那么从本次事务的lockDOs 中移除掉
                if (CollectionUtils.isNotEmpty(dbExistedRowKeys)) {
                    unrepeatedLockDOs = lockDOs.stream().filter(lockDO -> !dbExistedRowKeys.contains(lockDO.getRowKey()))
                            .collect(Collectors.toList());
                }
                //如果没有剩余，说明全部加锁成功，直接返回true
                if (CollectionUtils.isEmpty(unrepeatedLockDOs)) {
                    conn.rollback();
                    return true;
                }
            }

            // 以上是检查锁的逻辑，如果不检查锁，则直接将锁插入lock_table, 我们也就是在这抛出了异常
            if (unrepeatedLockDOs.size() == 1) {
            //单行加锁
                LockDO lockDO = unrepeatedLockDOs.get(0);
                if (!doAcquireLock(conn, lockDO)) {
                    if (LOGGER.isInfoEnabled()) {
                        LOGGER.info("Global lock acquire failed, xid {} branchId {} pk {}", lockDO.getXid(), lockDO.getBranchId(), lockDO.getPk());
                    }
                    conn.rollback();
                    return false;
                }
            } else {
            	//多行加锁
                if (!doAcquireLocks(conn, unrepeatedLockDOs)) {
                    if (LOGGER.isInfoEnabled()) {
                        LOGGER.info("Global lock batch acquire failed, xid {} branchId {} pks {}", unrepeatedLockDOs.get(0).getXid(),
                            unrepeatedLockDOs.get(0).getBranchId(), unrepeatedLockDOs.stream().map(lockDO -> lockDO.getPk()).collect(Collectors.toList()));
                    }
                    conn.rollback();
                    return false;
                }
            }
            conn.commit();
            return true;
        } catch (SQLException e) {
            throw new StoreException(e);
        } finally {
            IOUtil.close(rs, ps);
            if (conn != null) {
                try {
                    if (originalAutoCommit) {
                        conn.setAutoCommit(true);
                    }
                    conn.close();
                } catch (SQLException e) {
                }
            }
        }
    }

阅读代码得知，1.5默认如果只有插入操作，哪怕是多次插入，都不检查锁，所以插入会直接锁住行。而1.6只有首次插入不检查锁，后续插入或别的操作会判断是否有锁，是否一个全局事务，如果是则视为已加锁，否则已存在的锁不是一个事务就造成了锁冲突，从而加锁失败。

看似已经解决问题，偶尔看日志突然发现提交行锁中居然有undo_log 的行锁，仔细看发现，哪怕是插入一行记录，也会提交两次

2023-04-10 15:19:09 
07:19:09.016 INFO --- [verHandlerThread_1_12_500] i.seata.server.coordinator.AbstractCore : Register branch successfully, xid = 10.244.6.146:8091:1675707719303382163, branchId = 1675707719303390181, resourceId = jdbc:mysql://xxxxxxx:xxxx/test ,lockKeys = ent_enterprise_user:62651;undo_log:19493




2023-04-10 15:19:09 
,clientIp:10.244.6.1,vgroup:default_tx_group


2023-04-10 15:19:09 
07:19:09.012 INFO --- [ batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler : SeataMergeMessage xid=10.244.6.146:8091:1675707719303382163,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=ent_enterprise_user:62651;undo_log:19493


2023-04-10 15:19:09 
07:19:09.009 INFO --- [verHandlerThread_1_11_500] i.seata.server.coordinator.AbstractCore : Register branch successfully, xid = 10.244.6.146:8091:1675707719303382163, branchId = 1675707719303390179, resourceId = jdbc:mysql://xxxxxxx:xxxx/test ,lockKeys = ent_enterprise_user:62651


2023-04-10 15:19:09 
,clientIp:10.244.6.1,vgroup:default_tx_group


2023-04-10 15:19:09 
07:19:09.005 INFO --- [ batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler : SeataMergeMessage xid=10.244.6.146:8091:1675707719303382163,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=ent_enterprise_user:62651

很不正常，一般一个分支事务，注册一次就够了，为什么会注册两次，而且第二次还带了一个undo_log 的行锁。我不得已去翻了seata的源码，
发现注册实际上是本地事务提交时发生的。

在AbstractDMLBaseExecutor 中有如下代码

protected T executeAutoCommitTrue(Object[] args) throws Throwable {
        ConnectionProxy connectionProxy = statementProxy.getConnectionProxy();
        try {
            connectionProxy.changeAutoCommit();

            return new LockRetryPolicy(connectionProxy).execute(() -> {
                //执行本地sql，并生成undo_log（在内存中），根据undo_log 生成行锁
                T result = executeAutoCommitFalse(args);
                //向seata-server注册分支，并获取全局锁，undo_log插入本地库，并提交本地事务
                connectionProxy.commit();
                return result;
            });
        } catch (Exception e) {
            // when exception occur in finally,this exception will lost, so just print it here
            LOGGER.error("execute executeAutoCommitTrue error:{}", e.getMessage(), e);
            if (!LockRetryPolicy.isLockRetryPolicyBranchRollbackOnConflict()) {
                connectionProxy.getTargetConnection().rollback();
            }
            throw e;
        } finally {
            connectionProxy.getContext().reset();
            connectionProxy.setAutoCommit(true);
        }
    }

继续追踪 connectionProxy.commit();
最终在ConnectionProxy找到以下代码

  private void processGlobalTransactionCommit() throws SQLException {
        try {
            //发送给TC，注册分支并申请全局锁
            register();
        } catch (TransactionException e) {
            recognizeLockKeyConflictException(e, context.buildLockKeys());
        }
        try {
            //插入undo_log
            UndoLogManagerFactory.getUndoLogManager(this.getDbType()).flushUndoLogs(this);
            //本地事务提交
            targetConnection.commit();
        } catch (Throwable ex) {
            LOGGER.error("process connectionProxy commit error: {}", ex.getMessage(), ex);
            report(false);
            throw new SQLException(ex);
        }
        if (IS_REPORT_SUCCESS_ENABLE) {
            report(true);
        }
        context.reset();
    }

那什么情况下本地分支会提交两次呢？第二次还会带着undolog的行锁？

很明显，targetConnection 也是 ConnectionProxy ，所以才会注册两次。

为了证明这个，我做了个测试

@Component
public class TestService {
    @Autowired
    DataSource dataSource;

    @GlobalLock
    public void getDataSource() throws SQLException {
        Connection connection = dataSource.getConnection();
        System.out.println(connection.toString());
    }
}

通过debug看到

证实了我的猜想。

因为我们使用的是seata.enable-auto-data-source-proxy=true, seata自动将数据源进行代理，接下来我就去研究springboot 启动时seata的自动配置

@ConditionalOnBean(DataSource.class)
@ConditionalOnExpression("${seata.enabled:true} && ${seata.enableAutoDataSourceProxy:true} && ${seata.enable-auto-data-source-proxy:true}")
@AutoConfigureOrder(Ordered.LOWEST_PRECEDENCE)
@AutoConfigureAfter(value = {SeataCoreAutoConfiguration.class},
    name = "org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfiguration")
public class SeataDataSourceAutoConfiguration {

    /**
     * The bean seataAutoDataSourceProxyCreator.
     */
    @Bean(BEAN_NAME_SEATA_AUTO_DATA_SOURCE_PROXY_CREATOR)
    @ConditionalOnMissingBean(SeataAutoDataSourceProxyCreator.class)
    public static SeataAutoDataSourceProxyCreator seataAutoDataSourceProxyCreator(SeataProperties seataProperties) {
        return new SeataAutoDataSourceProxyCreator(seataProperties.isUseJdkProxy(),
            seataProperties.getExcludesForAutoProxying(), seataProperties.getDataSourceProxyMode());
    }

}

可以看出是 SeataAutoDataSourceProxyCreator 对数据源做的代理,

protected Object wrapIfNecessary(Object bean, String beanName, Object cacheKey) {
        // 不是数据源不关心
        if (!(bean instanceof DataSource)) {
            return bean;
        }

        // 如果数据源没有被代理，我们创建一个代理，并放入到DataSourceProxyHolder中维护一个原始 与 代理之间的映射
        if (!(bean instanceof SeataDataSourceProxy)) {
            Object enhancer = super.wrapIfNecessary(bean, beanName, cacheKey);
            // 如果bean的类型是被指定不需要代理的 或者 之前已经代理过了，直接返回
            if (bean == enhancer) {
                return bean;
            }
            // 不是，则创建代理并维护映射关系
            DataSource origin = (DataSource) bean;
            SeataDataSourceProxy proxy = buildProxy(origin, dataSourceProxyMode);
            DataSourceProxyHolder.put(origin, proxy);
            return enhancer;
        }

        //如果你手动创建了SeataDataSourceProxy，那么为了避免你自己创建了代理并起数据源还开了自动代理，从而代理了两层，seata 帮你拆开判断一下，如果不是两层，再给你封上
        LOGGER.warn("Manually register SeataDataSourceProxy(or its subclass) bean is discouraged! bean name: {}", beanName);
        SeataDataSourceProxy proxy = (SeataDataSourceProxy) bean;
        DataSource origin = proxy.getTargetDataSource();
        Object originEnhancer = super.wrapIfNecessary(origin, beanName, cacheKey);
        // this mean origin is either excluded by user or had been proxy before
        if (origin == originEnhancer) {
            return origin;
        }
        // else, put <origin, proxy> to holder and return originEnhancer
        DataSourceProxyHolder.put(origin, proxy);
        return originEnhancer;
    }

看到这，我突然想起了我们的读写库数据源，是一个两层的结构

@Configuration
@EnableAutoConfiguration(exclude = {DataSourceAutoConfiguration.class})
public class DataSourceConfig {

    @Bean(DB.MASTER)
    @ConfigurationProperties("spring.datasource.druid.master")
    public DataSource dataSourceMaster() {
        return DruidDataSourceBuilder.create().build();
    }

    @Bean(DB.SLAVE)
    @ConfigurationProperties("spring.datasource.druid.slave")
    public DataSource dataSourceSlave() {
        return DruidDataSourceBuilder.create().build();
    }

    @Primary
    @Bean("dataSourceDynamicRouting")
    public DataSource dataSourceDynamicRouting() {
        return new DynamicRoutingDataSource(dataSourceMaster(), dataSourceSlave());
    }
}

很明显，是master 与 slave 创建时被包了一层代理，而dataSourceDynamicRouting 又被包了一层代理，从而datasource 在 getConnection时获得了一个两层的ConnectionProxy

知道了原因，解决也比较方便，seata.excludes-for-auto-proxying=com.alibaba.druid.spring.boot.autoconfigure.DruidDataSourceWrapper
这个配置可以指定某个类型的数据源不被封装，可以帮我们解除一层代理

随后，测试结果

数据源正常

023-04-21 13:50:36 
05:50:36.794 INFO --- [ AsyncCommitting_1_1] io.seata.server.coordinator.DefaultCore : Committing global transaction is successfully done, xid = 10.244.0.213:8091:3990517555649138164.




2023-04-21 13:50:36 
05:50:36.780 INFO --- [ AsyncCommitting_1_1] io.seata.core.rpc.netty.ChannelManager : Choose [id: 0x0fbaad2f, L:/10.244.6.146:8091 - R:/10.244.6.1:51762] on the same application[test-user-service] as alternative of test-user-service:10.244.0.1:58008


2023-04-21 13:50:36 
05:50:36.745 INFO --- [ AsyncCommitting_1_1] io.seata.core.rpc.netty.ChannelManager : Choose [id: 0xe5041be9, L:/10.244.6.146:8091 - R:/10.244.6.1:52190] on the same application[test-enterprise-service] as alternative of test-enterprise-service:10.244.0.1:38770


2023-04-21 13:50:36 
05:50:36.534 INFO --- [ batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler : xid=10.244.0.213:8091:3990517555649138164,extraData=null,clientIp:10.244.0.1,vgroup:default_tx_group


2023-04-21 13:50:36 
05:50:36.424 INFO --- [verHandlerThread_1_19_500] i.seata.server.coordinator.AbstractCore : Register branch successfully, xid = 10.244.0.213:8091:3990517555649138164, branchId = 3990517555649138171, resourceId = jdbc:mysql://xxxxxxx:xxxx/test ,lockKeys = user:606019;user_info:564481


2023-04-21 13:50:36 
,clientIp:10.244.0.1,vgroup:default_tx_group


2023-04-21 13:50:36 
05:50:36.420 INFO --- [ batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler : SeataMergeMessage xid=10.244.0.213:8091:3990517555649138164,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=user:606019;user_info:564481


2023-04-21 13:50:36 
05:50:36.105 INFO --- [verHandlerThread_1_18_500] i.seata.server.coordinator.AbstractCore : Register branch successfully, xid = 10.244.0.213:8091:3990517555649138164, branchId = 3990517555649138168, resourceId = jdbc:mysql://xxxxxxx:xxxx/test ,lockKeys = ent_enterprise_user:65221


2023-04-21 13:50:36 
,clientIp:10.244.0.1,vgroup:default_tx_group


2023-04-21 13:50:36 
05:50:36.013 INFO --- [ batchLoggerPrint_1_1] i.s.c.r.p.server.BatchLogHandler : SeataMergeMessage xid=10.244.0.213:8091:3990517555649138164,branchType=AT,resourceId=jdbc:mysql://xxxxxxx:xxxx/test,lockKey=ent_enterprise_user:65221


2023-04-21 13:50:33 
05:50:33.531 INFO --- [verHandlerThread_1_17_500] i.s.s.coordinator.DefaultCoordinator : Begin new global transaction applicationId: test-user-service,transactionServiceGroup: default_tx_group, transactionName: insertBatch(java.util.List),timeout:120000,xid:10.244.0.213:8091:3990517555649138164

日志也正常了，至此问题解决

此间还有一个小插曲， DruidDataSourceBuilder.create().build() 生成的是DruidDataSourceWrapper 而不是DruidDataSource，这个踩了个坑。。。