问题排查过程
在某天的一个下午线上突然出现邮件预警,有个账户消费接口返回500,异常msg为Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022。
一开始通过查库发现该账号的消费日志和账户的余额是正常的,于是乎将该次请求的trace_log导出来发现在出现TransactionTimedOutException前出现了feign调用超时的异常,异常信息如下:
feign.RetryableException: Read timed out executing POST xxx
at feign.FeignException.errorExecuting(FeignException.java:65)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:105)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77)
at feign.hystrix.HystrixInvocationHandler$1.run$original$4jr1J8Wx(HystrixInvocationHandler.java:107)
at feign.hystrix.HystrixInvocationHandler$1.run$original$4jr1J8Wx$accessor$sQ1jtBAA(HystrixInvocationHandler.java)
at feign.hystrix.HystrixInvocationHandler$1$auxiliary$MLdm5cdf.call(Unknown Source)
at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86)
at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298)
...
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
...
这里feign调用就是往mq发送消息,于是去代码里查看发送消息是在哪个位置去进行发送的,伪代码如下:
@Service
public class AccountService {
@Resource
private MessageService messageService;
@Transactional(timeout = 5,rollbackFor = Exception.class)
public void consume() {
// 消费扣余额,记录消费日志
...
// 发消息
messageService.sendMessage();
}
}
@Service
public class MessageService {
public void sendMessage() {
...
TransactionSynchronizationManager.registerSynchronization(
new TransactionSynchronizationAdapter() {
@Override
public void afterCommit() {
//发消息
...
}
});
}
}
可以看到consume方法标注了@Transactional并且设置了事务超时时间为5s。往MQ发送消息利用了TransactionSynchronizationManager.registerSynchronization管理当前线程的事务,重写了afterCommit方法,让事务在提交后再执行消息的发送。排查到这里,一开始我怀疑是TransactionSynchronizationManager.registerSynchronization的用法问题导致,因此写了个小Demo进行测试,具体代码如下:
@Transactional(timeout = 5, rollbackFor = Exception.class)
public void test() {
TestEntity entity = new TestEntity();
entity.setName("test");
entity.setAge(1);
testMapper.insert(entity);
testService.testSynchronization();
}
public void testSynchronization() {
TransactionSynchronizationManager.registerSynchronization(
new TransactionSynchronizationAdapter() {
@Override
public void afterCommit() {
try {
Thread.sleep(6000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("test synchronization");
}
});
}
经过调试后发现尽管在afterCommit方法里让线程休眠了5秒,但是数据依然新增成功,并且接口返回200,到这里我的首次推断被推翻。
重新回到trace_log去排查,发现通过feign调用往mq发送消息前,有经过4次的Mysql读写操作,这4次加起来也就几ms,真正进行feign调用发消息前也就经过了3ms,也就是业务代码耗时是不存在的。继续往下排查,发现在feign调用超时后又获取了JDBC的连接,具体信息如下:
o.m.s.t.SpringManagedTransaction - JDBC Connection [org.apache.shardingsphere.shardingjdbc.jdbc.core.connection.ShardingConnection@6814b439] will be managed by Spring
c.f.e.c.mybatis.SQLExecutionWatcher - update, elapsedTime=4(ms)
o.s.jdbc.datasource.DataSourceUtils - Returning JDBC Connection to DataSource
c.f.e.w.p.w.e.ExceptionTrackingHandler - Resolving exception from handler org.springframework.transaction.TransactionTimedOutException: Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022
org.springframework.transaction.TransactionTimedOutException: Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022
at org.springframework.transaction.support.ResourceHolderSupport.checkTransactionTimeout(ResourceHolderSupport.java:141)
at org.springframework.transaction.support.ResourceHolderSupport.getTimeToLiveInMillis(ResourceHolderSupport.java:130)
at org.springframework.transaction.support.ResourceHolderSupport.getTimeToLiveInSeconds(ResourceHolderSupport.java:114)
at org.mybatis.spring.transaction.SpringManagedTransaction.getTimeout(SpringManagedTransaction.java:137)
...
at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:278)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:58)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:59)
at com.sun.proxy.$Proxy170.insertSelective(Unknown Source)
从报错的堆栈信息中可以看到这里有执行insert方法,跟踪代码发现在进行feign调用发送消息时有进行异常捕获,若出现异常则将消息存到表里。那么到这里问题就显而易见了,问题如下:
在首次操作数据库建立连接开启事务时通过org.mybatis.spring.transaction.SpringManagedTransaction#getTimeout方法计算出事务的TimeToLive。经过一系列的Mysql读写操作后通过TransactionSynchronizationManager.registerSynchronization进行同步器的注册。事务提交后执行了afterCommit方法,该方法内部通过feign调用进行消息发送。由于feign调用超时导致异常被捕获后执行发送失败的消息入库操作。进行入库操作时获取JDBC连接时判断事务已经过了超时时间,因此出现TransactionTimedOutException异常。
追踪下getTimeout方法源码,org.mybatis.spring.transaction.SpringManagedTransaction#getTimeout
public Integer getTimeout() throws SQLException {
ConnectionHolder holder = (ConnectionHolder) TransactionSynchronizationManager.getResource(dataSource);
if (holder != null && holder.hasTimeout()) {
return holder.getTimeToLiveInSeconds();
}
return null;
}
org.springframework.transaction.support.TransactionSynchronizationManager#getResource
public static Object getResource(Object key) {
Object actualKey = TransactionSynchronizationUtils.unwrapResourceIfNecessary(key);
// 根据key值获取连接
// 从TransactionSynchronizationManager.resources拿
// resources的结构为ThreadLocal<Map<Object, Object>>
Object value = doGetResource(actualKey);
if (value != null && logger.isTraceEnabled()) {
logger.trace("Retrieved value [" + value + "] for key [" + actualKey + "] bound to thread [" +
Thread.currentThread().getName() + "]");
}
return value;
}
public abstract class TransactionSynchronizationManager {
//线程上下文中保存着【线程池对象:ConnectionHolder】的Map对象。线程可以通过该属性获取到同一个Connection对象。
private static final ThreadLocal<Map<Object, Object>> resources = new NamedThreadLocal<>("Transactional resources");
//事务同步器,是Spring交由程序员进行扩展的代码,每个线程可以注册N个事务同步器。
private static final ThreadLocal<Set<TransactionSynchronization>> synchronizations = new NamedThreadLocal<>("Transaction synchronizations");
// 事务的名称
private static final ThreadLocal<String> currentTransactionName = new NamedThreadLocal<>("Current transaction name");
// 事务是否是只读
private static final ThreadLocal<Boolean> currentTransactionReadOnly = new NamedThreadLocal<>("Current transaction read-only status");
// 事务的隔离级别
private static final ThreadLocal<Integer> currentTransactionIsolationLevel = new NamedThreadLocal<>("Current transaction isolation level");
// 事务是否开启 actual:真实的
private static final ThreadLocal<Boolean> actualTransactionActive = new NamedThreadLocal<>("Actual transaction active");
}
在afterCommit方法中通过feign调用发送消息时是在主线程进行发送的,导致getTimeout方法内部通过checkTransactionTimeout方法进行判断时直接抛异常。
由于发送消息的方法只在主线程进行发送的,虽然不会影响数据的准确性,但是假设发送消息因网络波动耗时了一会儿会导致接口响应延迟,让用户体验不好。因此这块需要进行解耦,进行异步发送。那么该如何解决这个问题呢?
- 将发送消息的方法放在事务外,利用@Async进行异步发送。
- 利用@TransactionalEventListener+@Async进行异步发送
总结
每一次问题的排查、解决都是一次提升,共勉!