ShardingSphere XA事务无法恢复问题定位

ShardingSphere XA事务无法恢复问题定位

问题背景

在测试 ShardingSphere-Proxy 使用达梦数据库开启 XA 事务进行压测期间,kill 掉达梦数据库,等一段时间再重启达梦数据库,最终未完成的 XA 事务无法恢复,数据产生不一致。
查看达梦上的 xid 一直有残留,没有被处理。
使用 ShardingSphere-Proxy 结合其他主流数据库进行测试时无该问题。

通过日志可以看到在压测期间,挂掉达梦数据库之后,有大量连接报错。节选错误日志如下:
比如下面这段日志,在执行 xa end 操作时,因为达梦数据库连接不上,达梦驱动报错如下:

[WARN ] 2024-01-29 11:11:19.867 [Connection-82-ThreadExecutor] com.arjuna.ats.jta - ARJUNA016041: prepare on < formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a4e, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a54, subordinatenodename=null, eis_name=0 > (org.apache.shardingsphere.transaction.xa.spi.SingleXAResource@169b2776) failed with exception ARJUNA016099: Unknown error code:6001
dm.jdbc.driver.DmdbXAException: Communication  error
	at dm.jdbc.driver.DBError.throwXAException(DBError.java:795)
	at dm.jdbc.a.a.a(DBAccess.java:1315)
	at dm.jdbc.driver.DmdbXAResource.end(DmdbXAResource.java:104)
	at org.apache.shardingsphere.transaction.xa.spi.SingleXAResource.end(SingleXAResource.java:51)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.doEnd(TransactionImple.java:1087)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.endAssociation(TransactionImple.java:1058)
	at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.endAssociation(XAResourceRecord.java:1295)
	at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelPrepare(XAResourceRecord.java:210)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2678)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2628)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.prepare(BasicAction.java:2162)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1508)
	at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
	at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
	at org.apache.shardingsphere.transaction.xa.XAShardingSphereTransactionManager.commit(XAShardingSphereTransactionManager.java:129)
	at org.apache.shardingsphere.proxy.backend.connector.jdbc.transaction.BackendTransactionManager.commit(BackendTransactionManager.java:94)
	at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.doExecuteWithImplicitCommitTransaction(DatabaseConnector.java:259)
	at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.execute(DatabaseConnector.java:228)
	at org.apache.shardingsphere.proxy.frontend.mysql.command.query.text.query.MySQLComQueryPacketExecutor.execute(MySQLComQueryPacketExecutor.java:99)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.doExecuteCommand(CommandExecutorTask.java:129)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.executeCommand(CommandExecutorTask.java:124)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.run(CommandExecutorTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

下面这段日志,在执行 xa prepare 操作时,因为数据库连接不上,达梦驱动报错如下:

[WARN ] 2024-01-29 11:11:19.860 [Connection-85-ThreadExecutor] com.arjuna.ats.jta - ARJUNA016041: prepare on < formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5af8, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5b14, subordinatenodename=null, eis_name=0 > (org.apache.shardingsphere.transaction.xa.spi.SingleXAResource@4ce53a0a) failed with exception ARJUNA016099: Unknown error code:6001
dm.jdbc.driver.DmdbXAException: Communication  error
	at dm.jdbc.driver.DBError.throwXAException(DBError.java:795)
	at dm.jdbc.a.a.a(DBAccess.java:1315)
	at dm.jdbc.driver.DmdbXAResource.prepare(DmdbXAResource.java:174)
	at org.apache.shardingsphere.transaction.xa.spi.SingleXAResource.prepare(SingleXAResource.java:73)
	at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelPrepare(XAResourceRecord.java:214)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2678)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2628)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.prepare(BasicAction.java:2162)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1508)
	at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
	at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
	at org.apache.shardingsphere.transaction.xa.XAShardingSphereTransactionManager.commit(XAShardingSphereTransactionManager.java:129)
	at org.apache.shardingsphere.proxy.backend.connector.jdbc.transaction.BackendTransactionManager.commit(BackendTransactionManager.java:94)
	at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.doExecuteWithImplicitCommitTransaction(DatabaseConnector.java:259)
	at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.execute(DatabaseConnector.java:228)
	at org.apache.shardingsphere.proxy.frontend.mysql.command.query.text.query.MySQLComQueryPacketExecutor.execute(MySQLComQueryPacketExecutor.java:99)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.doExecuteCommand(CommandExecutorTask.java:129)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.executeCommand(CommandExecutorTask.java:124)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.run(CommandExecutorTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

xa commit 阶段,因为数据库连接不上,达梦驱动报错如下:

[WARN ] 2024-01-29 11:11:19.865 [Connection-56-ThreadExecutor] com.arjuna.ats.jta - ARJUNA016036: commit on < formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:58be, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:58c8, subordinatenodename=null, eis_name=0 > (org.apache.shardingsphere.transaction.xa.spi.SingleXAResource@576dcbef) failed with exception $ARJUNA016099: Unknown error code:6001
dm.jdbc.driver.DmdbXAException: Communication  error
	at dm.jdbc.driver.DBError.throwXAException(DBError.java:795)
	at dm.jdbc.a.a.a(DBAccess.java:1315)
	at dm.jdbc.driver.DmdbXAResource.commit(DmdbXAResource.java:88)
	at org.apache.shardingsphere.transaction.xa.spi.SingleXAResource.commit(SingleXAResource.java:44)
	at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelCommit(XAResourceRecord.java:473)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2906)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2822)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.phase2Commit(BasicAction.java:1878)
	at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1534)
	at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
	at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
	at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
	at org.apache.shardingsphere.transaction.xa.XAShardingSphereTransactionManager.commit(XAShardingSphereTransactionManager.java:129)
	at org.apache.shardingsphere.proxy.backend.connector.jdbc.transaction.BackendTransactionManager.commit(BackendTransactionManager.java:94)
	at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.doExecuteWithImplicitCommitTransaction(DatabaseConnector.java:259)
	at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.execute(DatabaseConnector.java:228)
	at org.apache.shardingsphere.proxy.frontend.mysql.command.query.text.query.MySQLComQueryPacketExecutor.execute(MySQLComQueryPacketExecutor.java:99)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.doExecuteCommand(CommandExecutorTask.java:129)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.executeCommand(CommandExecutorTask.java:124)
	at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.run(CommandExecutorTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

持续观察日志,能够看到 xa recover 查询出来的 xid 在不断增多。

[INFO ] 2024-01-29 11:11:53.159 [Periodic Recovery] o.a.s.t.x.n.m.DataSourceXAResourceRecoveryHelper - 115174042, recover, result:[dm.jdbc.driver.DmdbXid@6466d28f, dm.jdbc.driver.DmdbXid@2179218b, dm.jdbc.driver.DmdbXid@7cd785b1, dm.jdbc.driver.DmdbXid@2f889463, dm.jdbc.driver.DmdbXid@7da8d9b7, dm.jdbc.driver.DmdbXid@1535daf0, dm.jdbc.driver.DmdbXid@550f3fea, dm.jdbc.driver.DmdbXid@3ecf92a4, dm.jdbc.driver.DmdbXid@59181ab, dm.jdbc.driver.DmdbXid@3e4829b3, dm.jdbc.driver.DmdbXid@1294a1ff, dm.jdbc.driver.DmdbXid@500c5bbb, dm.jdbc.driver.DmdbXid@599914c8, dm.jdbc.driver.DmdbXid@2663abf9, dm.jdbc.driver.DmdbXid@11cf7d93, dm.jdbc.driver.DmdbXid@232c3d92, dm.jdbc.driver.DmdbXid@3282426b, dm.jdbc.driver.DmdbXid@9051ba1, dm.jdbc.driver.DmdbXid@63ccccf7, dm.jdbc.driver.DmdbXid@21ceadb1, dm.jdbc.driver.DmdbXid@17f32f9, dm.jdbc.driver.DmdbXid@2006e964, dm.jdbc.driver.DmdbXid@25a832e7, dm.jdbc.driver.DmdbXid@71dfb0c6, dm.jdbc.driver.DmdbXid@94fbcfe, dm.jdbc.driver.DmdbXid@1e15297a, dm.jdbc.driver.DmdbXid@7a4e0f55, dm.jdbc.driver.DmdbXid@761bc95a, dm.jdbc.driver.DmdbXid@247332b, dm.jdbc.driver.DmdbXid@352e94e0, dm.jdbc.driver.DmdbXid@3f9001b7, dm.jdbc.driver.DmdbXid@17e95e06, dm.jdbc.driver.DmdbXid@2c0cf9bb, dm.jdbc.driver.DmdbXid@6dc4f14e, dm.jdbc.driver.DmdbXid@5266c8f8, dm.jdbc.driver.DmdbXid@fa2a5dc, dm.jdbc.driver.DmdbXid@44a844b2, dm.jdbc.driver.DmdbXid@e2c93e8, dm.jdbc.driver.DmdbXid@6ae9bbdd, dm.jdbc.driver.DmdbXid@707f6f19, dm.jdbc.driver.DmdbXid@76742870, dm.jdbc.driver.DmdbXid@1cad4685, dm.jdbc.driver.DmdbXid@3b3d219e, dm.jdbc.driver.DmdbXid@30319647, dm.jdbc.driver.DmdbXid@67d11f99, dm.jdbc.driver.DmdbXid@113ee488, dm.jdbc.driver.DmdbXid@435d06e0, dm.jdbc.driver.DmdbXid@7fe22362, dm.jdbc.driver.DmdbXid@1020d6d0, dm.jdbc.driver.DmdbXid@76dc7af8, dm.jdbc.driver.DmdbXid@5d1d18c5, dm.jdbc.driver.DmdbXid@338ac937, dm.jdbc.driver.DmdbXid@258a1fc0, dm.jdbc.driver.DmdbXid@3014aa33, dm.jdbc.driver.DmdbXid@5fa57c4, dm.jdbc.driver.DmdbXid@60dcf53d, dm.jdbc.driver.DmdbXid@15dd92f9, dm.jdbc.driver.DmdbXid@8de743, dm.jdbc.driver.DmdbXid@579ed4e1, dm.jdbc.driver.DmdbXid@506cb632, dm.jdbc.driver.DmdbXid@29dfbc9c, dm.jdbc.driver.DmdbXid@4607f11b, dm.jdbc.driver.DmdbXid@679f1ce6, dm.jdbc.driver.DmdbXid@52f7fb01], delegate:class dm.jdbc.driver.DmdbXAResource

日志中没有看到期望的二阶段恢复操作,可疑的异常日志如下:Transaction 0:ffff7f000001:926b:65b7167e:5a65 restored heuristic participant

2024-01-29 14:09:25.984 [Periodic Recovery] WARN  com.arjuna.ats.jta - ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a65, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a76, subordinatenodename=null, eis_name=0 >, heuristic: TwoPhaseOutcome.HEURISTIC_HAZARD com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c769e03 >
2024-01-29 14:09:25.984 [Periodic Recovery] WARN  com.arjuna.ats.jta - XAResourceRecord restored heuristic instance: XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a65, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a76, subordinatenodename=null, eis_name=0 >, heuristic: TwoPhaseOutcome.HEURISTIC_HAZARD com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c769e03 >
2024-01-29 14:09:25.985 [Periodic Recovery] TRACE com.arjuna.ats.arjuna - RecordList::insert(RecordList: empty) : appending /StateManager/AbstractRecord/XAResourceRecord for 0:ffff7f000001:926b:65b7167e:5a77
2024-01-29 14:09:25.985 [Periodic Recovery] WARN  com.arjuna.ats.arjuna - Transaction 0:ffff7f000001:926b:65b7167e:5a65 restored heuristic participant XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a65, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a76, subordinatenodename=null, eis_name=0 >, heuristic: TwoPhaseOutcome.HEURISTIC_HAZARD com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c769e03 >
2024-01-29 14:09:25.985 [Periodic Recovery] TRACE com.arjuna.ats.arjuna - HeuristicList - Unpacked a 463 record

所以现在的问题就是 ShardingSphere-Proxy 能够查询出来 dameng 上没完成的事务,但是没有执行二阶段 xa commit 或 xa rollback 操作,具体原因需要继续分析。

同样对比 Oracle 数据库,在 XA 事务压测期间 kill 掉 Oracle,后续再重启 Oracle 事务是可以正常恢复的,最终数据一致。日志如下:

[INFO ] 2024-01-29 11:04:29.279 [Periodic Recovery] o.a.s.t.x.n.m.DataSourceXAResourceRecoveryHelper - 497281350, recover, result:[oracle.jdbc.xa.OracleXid@20b27], delegate:class oracle.jdbc.driver.T4CXAResource

[INFO ] 2024-01-29 11:04:30.285 [Periodic Recovery] o.a.s.t.x.n.m.DataSourceXAResourceRecoveryHelper - 497281350, rollback, xid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:a7f4:65b7143a:351b, node_name=15774b67-7ca7-4b1b-90ee-e18fec83225b, branch_uid=0:ffff7f000001:a7f4:65b7143a:351c, subordinatenodename=null, eis_name=0 >

问题定位

本地搭建环境尝试复现问题

前置准备:
1.执行脚本清理 dameng 和 mysql 数据库上残留的 xid 和锁,清理数据。
2.proxy 配置。

本地在 ShardingSphere-Proxy 执行 xa commit 之前,调用 shell 自动 kill 掉数据库,模拟二阶段 commit 失败,后续查看事务恢复流程。

public final class SingleXAResource implements XAResource {

    private final String resourceName;

    private final XAResource delegate;

    private static AtomicBoolean killed = new AtomicBoolean();

    @Override
    public void commit(final Xid xid, final boolean onePhase) throws XAException {
        count.getAndIncrement();
        // 在执行 xa commit 之前自动 kill 掉数据库实例
        if (killed.compareAndSet(false, true)) {
            ShellUtil.callShellByExec("docker stop mysqlv8-14310");
        }
        delegate.commit(xid, onePhase);
    }
    // ...
}

后续本地 debug 一直发现达梦数据库挂掉和 mysql 数据库挂掉后,两个后续执行的逻辑不一样。

mysql 的话,会正常走恢复流程。而达梦残留的 xa 事务一直被忽略处理。

最后 debug mysql 挂掉 和 达梦挂掉之后的区别。

发现二者在调用 xa 接口时因为数据库挂掉返回的错误码不一致,达梦返回的错误码没有按照 xa 规范进行处理,所以事务管理器 TM 无法根据错误码执行后续恢复操作。

MySQL 挂掉后事务恢复流程

MySQL JDBC 驱动处理逻辑如下:

// com.mysql.cj.jdbc.MysqlXAConnection
// MySQL 驱动会对数据库异常处理成符合 XA 规范的错误码返回。
protected static XAException mapXAExceptionFromSQLException(SQLException sqlEx) {
    Integer xaCode = (Integer)MYSQL_ERROR_CODES_TO_XA_ERROR_CODES.get(sqlEx.getErrorCode());
    return xaCode != null ? (XAException)(new MysqlXAException(xaCode, sqlEx.getMessage(), (String)null)).initCause(sqlEx) : (XAException)(new MysqlXAException(-7, Messages.getString("MysqlXAConnection.003"), (String)null)).initCause(sqlEx);
}

MySQL 在执行 xa commit 等阶段因为数据库挂掉报错如下,返回错误码 -7
在这里插入图片描述

查看 JTA 规范里 XAException 类中定义的错误码,-7 表示 RM unavailable.

// javax.transaction.xa.XAException
/**
 * Resource manager is unavailable.
 */
public final static int XAER_RMFAIL = -7;

在这里插入图片描述

Narayana 正确处理该错误码,后续进行事务恢复操作。

达梦挂掉后事务恢复流程

在这里插入图片描述

返回错误码 6001,看起来像是一个内部错误码,没有按照 XA 规范返回正确的错误码。会走最后一个分支,后续该事务无法恢复。
在这里插入图片描述

HEURISTIC_HAZARD 的含义如下:

// 在准备阶段之后,一些子参与者已经提交,一些已经回滚,还有一些我们不清楚。
public static final int HEURISTIC_HAZARD = 6;  // after prepare some sub-participants committed, some rolled back and some we don't know

XA规范

XA 规范中明确定义了 xa_commit 等操作应该返回的错误码。

-- 第47页
XAException.XAER_RMFAIL is used when some connection error happens and it is expected that reconnection of RM could occur

当发生某些连接错误时使用 XAException.XAER_RMFAIL,预计 RM 可能会重新连接

在这里插入图片描述

其他数据库处理流程

postgresql 驱动处理 xa 异常逻辑

Pg 同样处理,当数据库连接异常时返回 XAER_RMFAIL 错误码
在这里插入图片描述

问题解决

在TM侧兼容达梦原生异常,或者RM驱动修复返回正确的错误码即可。

参考

narayana Heuristic异常

xa规范文档

pg驱动对xa-commit异常处理的代码

  • 17
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

FlyingZCC

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值