场景:
最近rocketmq事务消息使用不当导致了线上问题;现象为本地事务执行失败,但是半消息还被发送出去了,最终导致了数据不一致;
rocketmq事务消息执行步骤:
事故还原:
发送半消息及执行本地事务
问题1:本地事务执行异常后,没有被下面的
catch
抓住
问题2:按理来说,本地事务执行失败后,rocketMq服务器应该调用客户端回查事务状态的接口,结果没有回查
问题3:半消息最后被roeketMq发送出去了
事故分析:
我们来跟踪下rocket客户端的源码;
@Override
public SendResult send(final Message message, final LocalTransactionExecuter executer, Object arg) {
this.checkONSProducerServiceState(this.transactionMQProducer.getDefaultMQProducerImpl());
com.aliyun.openservices.shade.com.alibaba.rocketmq.common.message.Message msgRMQ = ONSUtil.msgConvert(message);
com.aliyun.openservices.shade.com.alibaba.rocketmq.client.producer.TransactionSendResult sendResultRMQ = null;
try {
// 发送半消息
sendResultRMQ = transactionMQProducer.sendMessageInTransaction(msgRMQ,
new com.aliyun.openservices.shade.com.alibaba.rocketmq.client.producer.LocalTransactionExecuter() {
@Override
public LocalTransactionState executeLocalTransactionBranch(
com.aliyun.openservices.shade.com.alibaba.rocketmq.common.message.Message msg,
Object arg) {
String msgId = msg.getProperty(Constants.TRANSACTION_ID);
message.setMsgID(msgId);
TransactionStatus transactionStatus = executer.execute(message, arg);
if (TransactionStatus.CommitTransaction == transactionStatus) {
return LocalTransactionState.COMMIT_MESSAGE;
} else if (TransactionStatus.RollbackTransaction == transactionStatus) {
return LocalTransactionState.ROLLBACK_MESSAGE;
}
return LocalTransactionState.UNKNOW;
}
}, arg);
} catch (Exception e) {
throw new RuntimeException(e);
}
if (sendResultRMQ.getLocalTransactionState() == LocalTransactionState.ROLLBACK_MESSAGE) {
// 本地事务显式说明了有异常,此时给应用方返回一个异常。
throw new RuntimeException("local transaction branch failed ,so transaction rollback");
}
SendResult sendResult = new SendResult();
sendResult.setMessageId(sendResultRMQ.getMsgId());
sendResult.setTopic(sendResultRMQ.getMessageQueue().getTopic());
return sendResult;
}
半事务消息的send方法最终会调用到这里,此方法里还会调用transactionMQProducer.sendMessageInTransaction(final Message msg, final LocalTransactionExecuter tranExecuter, final Object arg)
方法发送半消息,并将本地事务包装成参数一并传进去;那么来看下transactionMQProducer.sendMessageInTransaction
方法:
@Override
public TransactionSendResult sendMessageInTransaction(final Message msg,
final LocalTransactionExecuter tranExecuter, final Object arg) throws MQClientException {
// 这里检查了是否配置本地事务回查路径 否则抛异常
if (null == this.transactionCheckListener) {
throw new MQClientException("localTransactionBranchCheckListener is null", null);
}
msg.setTopic(NamespaceUtil.wrapNamespace(this.getNamespace(), msg.getTopic()));
// 发送半事务消息 tranExecuter为本地事务
return this.defaultMQProducerImpl.sendMessageInTransaction(msg, tranExecuter, arg);
}
sendMessageInTransaction
方法首先检查了是否配置本地路径,没有则抛异常,紧接着调用sendMessageInTransaction
发送半事务消息
public TransactionSendResult sendMessageInTransaction(final Message msg,
final LocalTransactionExecuter tranExecuter, final Object arg)
throws MQClientException {
if (null == tranExecuter) {
throw new MQClientException("tranExecutor is null", null);
}
Validators.checkMessage(msg, this.defaultMQProducer);
SendResult sendResult = null;
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_TRANSACTION_PREPARED, "true");
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_PRODUCER_GROUP, this.defaultMQProducer.getProducerGroup());
try {
// 发送半事务消息 并 读取结果
sendResult = this.send(msg);
} catch (Exception e) {
throw new MQClientException("send message Exception", e);
}
LocalTransactionState localTransactionState = LocalTransactionState.UNKNOW;
Throwable localException = null;
switch (sendResult.getSendStatus()) {
case SEND_OK: {
try {
if (sendResult.getTransactionId() != null) {
msg.putUserProperty("__transactionId__", sendResult.getTransactionId());
}
// 执行本地事务
localTransactionState = tranExecuter.executeLocalTransactionBranch(msg, arg);
if (null == localTransactionState) {
localTransactionState = LocalTransactionState.UNKNOW;
}
if (localTransactionState != LocalTransactionState.COMMIT_MESSAGE) {
log.info("executeLocalTransactionBranch return {}", localTransactionState);
log.info(msg.toString());
}
// // 本地事务的异常竟然在这里被吃了。。。
} catch (Throwable e) {
log.info("executeLocalTransactionBranch exception", e);
log.info(msg.toString());
localException = e;
}
}
break;
case FLUSH_DISK_TIMEOUT:
case FLUSH_SLAVE_TIMEOUT:
case SLAVE_NOT_AVAILABLE:
localTransactionState = LocalTransactionState.ROLLBACK_MESSAGE;
break;
default:
break;
}
try {
this.endTransaction(sendResult, localTransactionState, localException);
} catch (Exception e) {
log.warn("local transaction execute " + localTransactionState + ", but end broker transaction failed", e);
}
TransactionSendResult transactionSendResult = new TransactionSendResult();
transactionSendResult.setSendStatus(sendResult.getSendStatus());
transactionSendResult.setMessageQueue(sendResult.getMessageQueue());
transactionSendResult.setMsgId(sendResult.getMsgId());
transactionSendResult.setQueueOffset(sendResult.getQueueOffset());
transactionSendResult.setTransactionId(sendResult.getTransactionId());
transactionSendResult.setLocalTransactionState(localTransactionState);
// 然后将异常封装到返回对象中
if (localException != null) {
transactionSendResult.setErrorMessage("executeLocalTransactionBranch error. " + localException.getMessage());
transactionSendResult.setRuntimeException(new RuntimeException(localException));
}
return transactionSendResult;
}
看到这里第一个问题已经清楚了。
sendResult = this.send(msg);
发送半事务消息tranExecuter.executeLocalTransactionBranch(msg, arg);
执行本地事务,本地事务如果有异常,会被下面的catch吃了transactionSendResult.setErrorMessage、transactionSendResult.setRuntimeException
异常被封装到返回对象里了
然后接着看问题2:因为本地事务消息回查是broker服务器端向客户端进行触发端,所以我们需要下载RocketMQ开源版端代码进行跟踪,经过跟踪我们排查到了类TransactionalMessageServiceImpl::check(long transactionTimeout, int transactionCheckMax, AbstractTransactionalMessageCheckListener listener)
由于上面这个方法含有的代码篇幅过长,就不进行复制了,主要逻辑就是在半事务消息的Topic:RMQ_SYS_TRANS_HALF_TOPIC
获取没有提交或者回滚的半事务消息,然后进行回调检查,该方法中比较核心的检查方法为:AbstractTransactionalMessageCheckListener#resolveHalfMsg
然后我们继续点sendCheckMessage方法跟踪到类AbstractTransactionalMessageCheckListener::sendCheckMessage(MessageExt msgExt)
的回调方法,发现了其中一个重要的代码
从以上代码我们发现,半事务消息在进行本地事务消息回查的过程中需要使用事务消息发送者的groupId,然后根据groupId再获取Netty的channel对象才能进行本地事务消息回查,正是由于我们在生产者事务消息发送时没有设置生产者的Group,默认为default groupId,如果有多个生产者(不同服务),rocketmq服务器不知道回调哪个服务的回查接口,很有可能回查到其他服务的回查接口,导致发送半事务消息的服务没有接收到回查。这也就解释了第二个问题。
目前线上应用事务消息一共有两个服务,经日志排查,本地事务回查接口被打到了另一个服务上,而另一个服务回查接口返回了消息确认,导致rocket将半事务消息推送出去;这也就解释了第三个问题。
另一个服务回查代码如下:
总结:
1、本地事物要自己捕获异常
2、使用事务消息时生产者要设置groupId,告诉rocketMq服务器回查哪个客户端本地事务