Lettuce中RedisCommandTimeoutException异常分析

gb4215287

于 2024-09-10 20:01:34 发布

阅读量23

点赞数

分类专栏： java 文章标签： bootstrap 前端 html

原文链接：https://juejin.cn/post/7307723901529407525

版权

java 专栏收录该内容

899 篇文章 17 订阅

订阅专栏

一、背景

服务端采用 Redis 集群配置，版本 7.0，三主三从。客户端基于 Spring Boot 2.7.0 开发，spring-boot-starter-data-redis 版本为 2.7.0。

配置了 Redis 拓扑自适应刷新，没有配置定时刷新，配置如下。

yaml

代码解读

复制代码

spring: redis: cluster: nodes: - 10.200.1.1:6389 - 10.200.1.2:6389 - 10.200.1.3:6389 - 10.200.1.1:6390 - 10.200.1.2:6390 - 10.200.1.3:6390 timeout: 2s username: xxx password: xxxx lettuce: cluster: refresh: adaptive: true

某天，生产出现大量报错，持续了 20 分钟都没恢复，最后重启程序后恢复正常。

sql

代码解读

复制代码

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)

二、分析

原因为当日 10.200.1.1 这台服务器发生硬件故障，查看 redis 集群状态（原状态为三台服务器的 6389 端口为主节点），发生了主从切换，10.200.1.2:6390 切换为主节点，之后 Redis 集群可正常提供服务。

Redis 集群拓扑有变化，为什么客户端没有自动刷新呢？在之前的应急演练中，可是都能刷新的。(应急演练时，主要做了 Redis 节点切换，以及直接 kill -9 某个节点)

查看源码：

java

代码解读

复制代码

public class ClusterTopologyRefreshOptions { public Builder enableAllAdaptiveRefreshTriggers() { adaptiveRefreshTriggers.addAll(EnumSet.allOf(RefreshTrigger.class)); return this; } public enum RefreshTrigger { /** * Redis responds with a {@code MOVED} redirection to a command. */ MOVED_REDIRECT, /** * Redis responds with a {@code ASK} redirection to a command. */ ASK_REDIRECT, /** * Connections to a particular host run into persistent reconnects (more than one attempt). */ PERSISTENT_RECONNECTS, /** * Attempts to use a slot that is not covered by a known node. * * @since 5.2 */ UNCOVERED_SLOT, /** * Connection attempts to unknown nodes. * * @since 5.1 */ UNKNOWN_NODE } }

MOVED_REDIRECT, ASK_REDIRECT，分别对应 Redis 的 MOVED、ASK 指令，主要针对槽位发生变化（比如 Redis 集群扩容/缩容）或者节点发生主从切换，此时所查找的 key 不在当前节点，因此需要重定向到其他节点。既然是 Redis 返回的指令，前提自然是节点正常。

PERSISTENT_RECONNECTS，连接异常，重新连接，前提是服务端能正常的返回 tcp 报文(FIN/RST)。当 Redis 节点挂了时，比如 10.200.1.1:6389 进程不存在了，操作系统未找到 6389 端口，会返回 RST 报文。

UNCOVERED_SLOT，槽位未覆盖；UNKNOWN_NODE，未知节点。这 2 个一般不会触发。

以上分析可知，硬件故障很可能不触发自适应刷新。

三、模拟

通过 ifdown <物理端口> 直接关闭 10.200.1.1 服务器的网卡，模拟服务器失联情况下的是否能自动切换。

1、仅配置自适应刷新

查看客户端 socket 连接，没有变化，通过命令 netstat -an | grep -i estab | grep :63

查看，发现原先连接到 10.200.1.1:6389 的 socket 连接 Send-Q 数值在不断增加，说明请求还是会往这条不通的连接上发送。程序端不停的在报 RedisCommandTimeoutException。

2、配置了自适应刷新 + 定时刷新

增加配置 spring.redis.lettuce.cluster.refresh.period=30s

查看客户端 socket 连接，刚开始现象和仅配置自适应刷新一致，10.200.1.1:6389 的 socket 连接 Send-Q 数值在不断增加，说明请求还是会往这条不通的连接上发送，程序端不停的在报 RedisCommandTimeoutException。大概 32s 以后，程序端不再报错。查看 10.200.1.1:6389 的 socket 连接，虽然还是 establish 状态，但是 Send-Q 不再增加，说明客户端拓扑已经刷新，新的请求会走 10.200.1.2:6390。

3、同时配置了自适应刷新 + 定时刷新 + 定时校验连接是否正常

继续增加配置：

java

代码解读

复制代码

@Component public class LettuceConnectionValidTask { @Autowired private RedisConnectionFactory redisConnectionFactory; @Scheduled(cron = "0/2 * * * * *") public void task() { if (redisConnectionFactory instanceof LettuceConnectionFactory) { LettuceConnectionFactory f = (LettuceConnectionFactory) redisConnectionFactory; f.validateConnection(); } } }

查看程序，除了每隔几秒打印 Validation of shared connection failed. Creating a new connection. 错误日志外，其余现象和配置了自适应刷新 + 定时刷新一致，程序也在大概 33s 后恢复正常。通过以上模拟，可得出结论：为保证客户端能及时的刷新 Redis 拓扑结构，需要同时配置自适应刷新和定时刷新，定时校验连接是否正常无效。

yaml

代码解读

复制代码

spring: redis: lettuce: cluster: refresh: adaptive: true period: 30s

四、源码分析

为什么配置了定时校验连接无效呢？

继续查看源码：

java

代码解读

复制代码

public class LettuceConnectionFactory implements InitializingBean, DisposableBean, RedisConnectionFactory, ReactiveRedisConnectionFactory { { void validateConnection() { synchronized (this.connectionMonitor) { boolean valid = false; if (connection != null && connection.isOpen()) { try { if (connection instanceof StatefulRedisConnection) { ((StatefulRedisConnection) connection).sync().ping(); } //连接异常时，无法收到pong回复，直接抛出异常 if (connection instanceof StatefulRedisClusterConnection) { ((StatefulRedisClusterConnection) connection).sync().ping(); } valid = true; } catch (Exception e) { log.debug("Validation failed", e); } } if (!valid) { log.info("Validation of shared connection failed. Creating a new connection."); resetConnection(); //重新获取连接 this.connection = getNativeConnection(); } } } }

校验连接异常时，会重新获取连接。继续查看 getNativeConnection() 方法，一路追踪，

java

代码解读

复制代码

public class RedisClusterClient { private <K, V> CompletableFuture<StatefulRedisClusterConnection<K, V>> connectClusterAsync(RedisCodec<K, V> codec) { if (partitions == null) { return Futures.failed(new IllegalStateException( "Partitions not initialized. Initialize via RedisClusterClient.getPartitions().")); } //重点关注 topologyRefreshScheduler.activateTopologyRefreshIfNeeded(); logger.debug("connectCluster(" + initialUris + ")"); DefaultEndpoint endpoint = new DefaultEndpoint(getClusterClientOptions(), getResources()); RedisChannelWriter writer = endpoint; if (CommandExpiryWriter.isSupported(getClusterClientOptions())) { writer = new CommandExpiryWriter(writer, getClusterClientOptions(), getResources()); } if (CommandListenerWriter.isSupported(getCommandListeners())) { writer = new CommandListenerWriter(writer, getCommandListeners()); } ClusterDistributionChannelWriter clusterWriter = new ClusterDistributionChannelWriter(writer, getClusterClientOptions(), topologyRefreshScheduler); PooledClusterConnectionProvider<K, V> pooledClusterConnectionProvider = new PooledClusterConnectionProvider<>(this, clusterWriter, codec, topologyRefreshScheduler); clusterWriter.setClusterConnectionProvider(pooledClusterConnectionProvider); //连接对象，基于 clusterWriter，而 clusterWriter 又关联了 topologyRefreshScheduler StatefulRedisClusterConnectionImpl<K, V> connection = newStatefulRedisClusterConnection(clusterWriter, pooledClusterConnectionProvider, codec, getDefaultTimeout()); connection.setReadFrom(ReadFrom.UPSTREAM); connection.setPartitions(partitions); Supplier<CommandHandler> commandHandlerSupplier = () -> new CommandHandler(getClusterClientOptions(), getResources(), endpoint); Mono<SocketAddress> socketAddressSupplier = getSocketAddressSupplier(connection::getPartitions, TopologyComparators::sortByClientCount); //创建物理连接 Mono<StatefulRedisClusterConnectionImpl<K, V>> connectionMono = Mono .defer(() -> connect(socketAddressSupplier, endpoint, connection, commandHandlerSupplier)); for (int i = 1; i < getConnectionAttempts(); i++) { connectionMono = connectionMono .onErrorResume(t -> connect(socketAddressSupplier, endpoint, connection, commandHandlerSupplier)); } return connectionMono.doOnNext( c -> connection.registerCloseables(closeableResources, clusterWriter, pooledClusterConnectionProvider)) .map(it -> (StatefulRedisClusterConnection<K, V>) it).toFuture(); } }

实际创建连接时，需要获取 Redis 集群的拓扑状态，topologyRefreshScheduler 用于刷新状态，接下来看下何时刷新，即以上需要重点关注的 topologyRefreshScheduler.activateTopologyRefreshIfNeeded() 方法。

java

代码解读

复制代码

class ClusterTopologyRefreshScheduler implements Runnable, ClusterEventListener { protected void activateTopologyRefreshIfNeeded() { ClusterClientOptions options = clientOptions.get(); ClusterTopologyRefreshOptions topologyRefreshOptions = options.getTopologyRefreshOptions(); //periodicRefreshEnabled 自动刷新字段默认为 false，当配置 spring.redis.lettuce.cluster.refresh.period 时，设置为 true if (!topologyRefreshOptions.isPeriodicRefreshEnabled() || clusterTopologyRefreshActivated.get()) { return; } if (clusterTopologyRefreshActivated.compareAndSet(false, true)) { //开启定时任务，假设配置的 spring.redis.lettuce.cluster.refresh.period=30s，则该定时任务 30s 后开始执行，并且每隔 30s 执行一次 ScheduledFuture<?> scheduledFuture = genericWorkerPool.scheduleAtFixedRate(this, options.getRefreshPeriod().toNanos(), options.getRefreshPeriod().toNanos(), TimeUnit.NANOSECONDS); clusterTopologyRefreshFuture.set(scheduledFuture); } } private final ClusterTopologyRefreshTask clusterTopologyRefreshTask; public void run() { logger.debug("ClusterTopologyRefreshScheduler.run()"); if (isEventLoopActive()) { if (!clientOptions.get().isRefreshClusterView()) { logger.debug("Periodic ClusterTopologyRefresh is disabled"); return; } } else { logger.debug("Periodic ClusterTopologyRefresh is disabled"); return; } //实际的定时任务是 clusterTopologyRefreshTask clientResources.eventExecutorGroup().submit(clusterTopologyRefreshTask); } private static class ClusterTopologyRefreshTask extends AtomicBoolean implements Runnable { private final Supplier<CompletionStage<?>> reloadTopologyAsync; ClusterTopologyRefreshTask(Supplier<CompletionStage<?>> reloadTopologyAsync) { this.reloadTopologyAsync = reloadTopologyAsync; } public void run() { if (compareAndSet(false, true)) { doRun(); return; } } void doRun() { try { //实际的定时任务是reloadTopologyAsync reloadTopologyAsync.get().whenComplete((ignore, throwable) -> { if (throwable != null) { logger.warn("Cannot refresh Redis Cluster topology", throwable); } set(false); }); } catch (Exception e) { logger.warn("Cannot refresh Redis Cluster topology", e); } } } }

继续追踪代码，查看最终的定时任务，是 RedisClusterClient.refreshPartitionsAsync() 方法。

java

代码解读

复制代码

public class RedisClusterClient extends AbstractRedisClient { public CompletionStage<Void> refreshPartitionsAsync() { List<RedisURI> sources = new ArrayList<>(); Iterable<RedisURI> topologyRefreshSource = getTopologyRefreshSource(); for (RedisURI redisURI : topologyRefreshSource) { sources.add(redisURI); } EventRecorder.RecordableEvent event = EventRecorder.getInstance().start(new TopologyRefreshEvent(sources)); if (partitions == null) { return initializePartitions().thenAccept(Partitions::updateCache) .whenComplete((unused, throwable) -> event.record()); } //获取最新的 Redis 集群信息，包括节点/槽位等 return loadPartitionsAsync().thenAccept(loadedPartitions -> { if (TopologyComparators.isChanged(getPartitions(), loadedPartitions)) { logger.debug("Using a new cluster topology"); List<RedisClusterNode> before = new ArrayList<>(getPartitions()); List<RedisClusterNode> after = new ArrayList<>(loadedPartitions); getResources().eventBus().publish(new ClusterTopologyChangedEvent(before, after)); } this.partitions.reload(loadedPartitions.getPartitions()); updatePartitionsInConnections(); }).whenComplete((unused, throwable) -> event.record()); } }

由以上代码分析可知，当校验到连接无效时，会重新创建连接。创建连接需要获取 Redis 集群的拓扑结构，而刷新 Redis 集群拓扑结构在单独的线程中，每 30s 执行一次。所以，刚开始重新创建连接时，由于拓扑结构还没刷新，客户端认为 10.200.1.1:6389 还是 master 节点，会继续尝试连接该节点，从而失败；直到 30s 左右，定时任务执行后才会获取到最新的拓扑，从而连接 10.200.1.2:6390，之后程序不再报错。

五、思考

刷新 Redis 集群拓扑由单独的线程作为定时任务处理的最大问题是，会导致客户端不能及时获取集群的最新状态，在这期间，请求报错，是否有更合理的设计呢？

A：校验到连接异常时，立即刷新 Redis 集群拓扑，从而获取可用连接。缺点：Redis 集群本身拓扑没有变化时，直接重新获取连接即可，没必要刷新；更大的问题在于，定时校验连接本身就是一项开销。

B：定时刷新拓扑的时间设置的小一些，比如 5s。缺点：还是存在 5s 左右的异常期，同时也需要考虑 cluster nodes 命令的开销。

C：客户端报错 RedisCommandTimeoutException 时刷新。缺点：报错的原因有可能是节点异常，没有响应；也有可能是出现了长耗时命令，导致在规定的时间内没有收到，但是客户端无法区分。如果因为代码不规范导致大量的超时的话，此时不停的刷新拓扑会导致服务端压力大。当然也可以优化，比如触发刷新时，判断与上次刷新的间隔，如果在 5s 内则不执行。

作者：pggsnap
链接：https://juejin.cn/post/7307723901529407525
来源：稀土掘金
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。