5、value为null引发的Redisson异常之RedisResponseTimeoutException

最新推荐文章于 2024-07-18 21:41:17 发布

小白de成长之路

最新推荐文章于 2024-07-18 21:41:17 发布

阅读量2.5k

点赞数 6

分类专栏：日常Bug分析与修复方案存储相关文章标签： redis 数据库

本文链接：https://blog.csdn.net/java__ee/article/details/131326432

版权

存储相关同时被 2 个专栏收录

21 篇文章 1 订阅

订阅专栏

日常Bug分析与修复方案

8 篇文章 0 订阅

订阅专栏

文章目录

项目场景：
问题描述
原因分析：
解决方案：
感悟：

项目场景：

项目中使用redis存储一些热点数据

问题描述

升级后次日早晨，生产监控群出现大量redis相关异常，异常堆栈信息如下：

org.springframework.dao.QueryTimeoutException: Redis server response timeout (3000 ms) occured after 0 retry attempts. 
Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: (INCRBY), 
params: [[-84, -19, 0, 5, 116, 0, 18, 101, 110, 113, ...], 9], 
channel: [id: 0xd***e, L:/17*.*.*.*:3***6 - R:r-2zeijkfagqgv2g42ye.redis.rds.aliyuncs.com/17*.*.*.*:6379]; 
nested exception is org.redisson.client.RedisResponseTimeoutException: 
Redis server response timeout (3000 ms) occured after 0 retry attempts. 
Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: (INCRBY),
params: [[-84, -19, 0, 5, 116, 0, 18, 101, 110, 113, ...], 9], 
channel: [id: 0xdcb***ce, L:/17*.*.*.*:3***6 - R:r-2zeijkfagqgv2g42ye.redis.rds.aliyuncs.com/10.1*****:6379]
	at org.redisson.spring.data.connection.RedissonExceptionConverter.convert(RedissonExceptionConverter.java:48)
	at org.redisson.spring.data.connection.RedissonExceptionConverter.convert(RedissonExceptionConverter.java:35)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.redisson.spring.data.connection.RedissonConnection.transform(RedissonConnection.java:197)
	at org.redisson.spring.data.connection.RedissonConnection.syncFuture(RedissonConnection.java:192)
	at org.redisson.spring.data.connection.RedissonConnection.sync(RedissonConnection.java:360)
	at org.redisson.spring.data.connection.RedissonConnection.write(RedissonConnection.java:726)
	at org.redisson.spring.data.connection.RedissonConnection.incrBy(RedissonConnection.java:574)
	at org.springframework.data.redis.core.DefaultValueOperations.lambda$increment$1(DefaultValueOperations.java:167)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:223)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:190)
	at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:97)
	at org.springframework.data.redis.core.DefaultValueOperations.increment(DefaultValueOperations.java:167)

原因分析：

首先排查堆栈，发现代码出现bug，导致出现存入redis中的数据，value为null，当时由于只有极少情况才会出现value为空，就准备晚上修复，开始排查为什么超时。
1.联系运维查看redis运行情况，检查Redis参数配置；（近期未修改，没有发现问题）
2.网络原因；（网络未抖动）
3.开始查看Redis源码，此时有同事反映发现redis存入value为null时，就会出现RedisResponseTimeoutException问题，结合源码和现象发现了问题。继续往下看分析。
4.从日志的表面看，的确就是超时了，第一句话“org.springframework.dao.QueryTimeoutException: Redis server response timeout (3000 ms) occured after 0 retry attempts. Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. ”表示超过了 3000ms 还没有响应，3000ms 是 redisson 的响应超时时间，那么这个具体是从哪里来的呢？在源码中，（几个重要方法write，writeAsync，async，execute，checkWriteFuture，scheduleResponseTimeout）

	<T> T write(byte[] key, Codec codec, RedisCommand<?> command, Object... params) {
        RFuture<T> f = executorService.writeAsync(key, codec, command, params);
        indexCommand(command);
        return sync(f);
    }
	
	public <T, R> RFuture<R> writeAsync(byte[] key, Codec codec, RedisCommand<T> command, Object... params) {
        NodeSource source = getNodeSource(key);
        return async(false, source, codec, command, params, false, false);
    }

	public <V, R> RFuture<R> async(boolean readOnlyMode, NodeSource source, Codec codec,
            RedisCommand<V> command, Object[] params, boolean ignoreRedirect, boolean noRetry) {
        CompletableFuture<R> mainPromise = createPromise();
        RedisExecutor<V, R> executor = new RedisExecutor<>(readOnlyMode, source, codec, command, params, mainPromise,
                                                    ignoreRedirect, connectionManager, objectBuilder, referenceType, noRetry);
        executor.execute();
        return new CompletableFutureWrapper<>(mainPromise);
    }

	public void execute() {
        if (mainPromise.isCancelled()) {
            free();
            return;
        }

        if (!connectionManager.getShutdownLatch().acquire()) {
            free();
            mainPromise.completeExceptionally(new RedissonShutdownException("Redisson is shutdown"));
            return;
        }

        codec = getCodec(codec);

        CompletableFuture<RedisConnection> connectionFuture = getConnection().toCompletableFuture();

        CompletableFuture<R> attemptPromise = new CompletableFuture<>();
        mainPromiseListener = (r, e) -> {
            if (mainPromise.isCancelled() && connectionFuture.cancel(false)) {
                log.debug("Connection obtaining canceled for {}", command);
                timeout.cancel();
                if (attemptPromise.cancel(false)) {
                    free();
                }
            }
        };

        if (attempt == 0) {
            mainPromise.whenComplete((r, e) -> {
                if (this.mainPromiseListener != null) {
                    this.mainPromiseListener.accept(r, e);
                }
            });
        }

        scheduleRetryTimeout(connectionFuture, attemptPromise);

        connectionFuture.whenComplete((connection, e) -> {
            if (connectionFuture.isCancelled()) {
                connectionManager.getShutdownLatch().release();
                return;
            }

            if (connectionFuture.isDone() && connectionFuture.isCompletedExceptionally()) {
                connectionManager.getShutdownLatch().release();
                exception = convertException(connectionFuture);
                return;
            }

            sendCommand(attemptPromise, connection);

            writeFuture.addListener(new ChannelFutureListener() {
                @Override
                public void operationComplete(ChannelFuture future) throws Exception {
                    checkWriteFuture(writeFuture, attemptPromise, connection);
                }
            });
        });

        attemptPromise.whenComplete((r, e) -> {
            releaseConnection(attemptPromise, connectionFuture);

            checkAttemptPromise(attemptPromise, connectionFuture);
        });
    }

	private void checkWriteFuture(ChannelFuture future, CompletableFuture<R> attemptPromise, RedisConnection connection) {
        if (future.isCancelled() || attemptPromise.isDone()) {
            return;
        }

        if (!future.isSuccess()) {
            exception = new WriteRedisConnectionException(
                    "Unable to write command into connection! Increase connection pool size. Node source: " + source + ", connection: " + connection +
                    ", command: " + LogHelper.toString(command, params)
                    + " after " + attempt + " retry attempts", future.cause());
            if (attempt == attempts) {
                attemptPromise.completeExceptionally(exception);
            }
            return;
        }

        timeout.cancel();

        scheduleResponseTimeout(attemptPromise, connection);
    }

redisson 的写入操作是一个异步转同步的过程，使用了 sync 方法阻塞等待 future 的结果。既然是异步的过程，它是如何控制超时时间的，答案是org.redisson.command.RedisExecutor#scheduleResponseTimeout方法，这个方法会起一个定时任务，任务的在 3000ms（可配置）之后执行，任务的内容是将操作结果置为超时，也就是说，如果 3000ms 内该任务没有被 cancel 掉，就会抛出超时的异常。核心逻辑是这一段代码：

private void scheduleResponseTimeout(CompletableFuture<R> attemptPromise, RedisConnection connection) {
        long timeoutTime = responseTimeout;
        if (command != null && command.isBlockingCommand()) {
            Long popTimeout = null;
            if (RedisCommands.BLOCKING_COMMANDS.contains(command)) {
                for (int i = 0; i < params.length-1; i++) {
                    if ("BLOCK".equals(params[i])) {
                        popTimeout = Long.valueOf(params[i+1].toString()) / 1000;
                        break;
                    }
                }
            } else {
                popTimeout = Long.valueOf(params[params.length - 1].toString());
            }

            handleBlockingOperations(attemptPromise, connection, popTimeout);
            if (popTimeout == 0) {
                return;
            }
            timeoutTime += popTimeout * 1000;
            // add 1 second due to issue https://github.com/antirez/redis/issues/874
            timeoutTime += 1000;
        }

        long timeoutAmount = timeoutTime;
        TimerTask timeoutResponseTask = timeout -> {
            if (isResendAllowed(attempt, attempts)) {
                if (!attemptPromise.cancel(false)) {
                    return;
                }

                connectionManager.newTimeout(t -> {
                    attempt++;
                    if (log.isDebugEnabled()) {
                        log.debug("attempt {} for command {} and params {}",
                                attempt, command, LogHelper.toString(params));
                    }

                    mainPromiseListener = null;
                    execute();
                }, retryInterval, TimeUnit.MILLISECONDS);
                return;
            }

            attemptPromise.completeExceptionally(
                    new RedisResponseTimeoutException("Redis server response timeout (" + timeoutAmount + " ms) occured"
                            + " after " + attempt + " retry attempts. Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: "
                            + LogHelper.toString(command, params) + ", channel: " + connection.getChannel()));
        };

        timeout = connectionManager.newTimeout(timeoutResponseTask, timeoutTime, TimeUnit.MILLISECONDS);
    }

上述代码表明writeFuture有一个 Listener，在 write 操作完成的时候进行检查checkWriteFuture。也就是说，在scheduleResponseTimeout之前，实际请求已经发出去了，那么 redis 服务端有没有收到呢？使用 redis-cli 的 monitor 功能查看服务端接收到的请求，的确能看到写入的请求，模拟发现 redis 的服务端看，对 SET 操作的处理也非常快。那问题是不是在响应处理这一端出现的呢？接下来我们来看一看 redisson 在处理请求/响应时到底干了什么，具体可以参考org.redisson.client.handler.RedisChannelInitializer#initChannel方法，代码如下：

@Override
protected void initChannel(Channel ch) throws Exception {
    initSsl(config, ch);
    
    if (type == Type.PLAIN) {
        ch.pipeline().addLast(new RedisConnectionHandler(redisClient));
    } else {
        ch.pipeline().addLast(new RedisPubSubConnectionHandler(redisClient));
    }

    ch.pipeline().addLast(
        connectionWatchdog,
        CommandEncoder.INSTANCE,
        CommandBatchEncoder.INSTANCE);

    if (type == Type.PLAIN) {
        ch.pipeline().addLast(new CommandsQueue());
    } else {
        ch.pipeline().addLast(new CommandsQueuePubSub());
    }

    if (pingConnectionHandler != null) {
        ch.pipeline().addLast(pingConnectionHandler);
    }
    
    if (type == Type.PLAIN) {
        ch.pipeline().addLast(new CommandDecoder(config.getAddress().getScheme()));
    } else {
        ch.pipeline().addLast(new CommandPubSubDecoder(config));
    }

    ch.pipeline().addLast(new ErrorsLoggingHandler());

    config.getNettyHook().afterChannelInitialization(ch);
}

代码中可以看到很多熟悉的身影，比如pingConnectionHandler、CommandEncoder、connectionWatchdog。从以上的代码中，我们可以简单画一下输入和输出的 pipeline，输出（请求 redis）的 pipieline 如下：

ErrorsLoggingHandler -> CommandsQueue -> CommandBatchEncoder -> CommandEncoder

输入（响应 redis，还包括建立连接等）的 pipeline 如下：

RedisConnectionHandler -> ConnectionWatchdog -> PingConnectionHandler -> CommandDecoder -> ErrorsLoggingHandler

在输出即请求链路上，按我们的分析，set null 时在CommandEncoder抛出了异常，中断了请求，在 set 正常值时链路时正常的。在输入即响应链路上，最主要的是CommandDecoder，其他要么是处理日志，要么是处理连接事件，这里我们着重分析一下CommandDecoder，即解码器。解码的第一步很重要，就是取出对应请求时的操作命令，redisson 是这么做的：

protected QueueCommand getCommand(ChannelHandlerContext ctx) {
    Queue<QueueCommandHolder> queue = ctx.channel().attr(CommandsQueue.COMMANDS_QUEUE).get();
    QueueCommandHolder holder = queue.peek();
    if (holder != null) {
        return holder.getCommand();
    }
    return null;
}

它从 channel 中取出了一个 queue，然后从 queue 中 peek 出了QueueCommandHolder作为当前的要处理的命令响应。那既然可以从 queue 中取，那又是什么时候塞入 queue 的呢，我们可以看到在请求的 pipiline 中，有一个CommandsQueue，它是在这里把命令塞到 queue 里面的。

@Override
public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) throws Exception {
    if (msg instanceof QueueCommand) {
        QueueCommand data = (QueueCommand) msg;
        QueueCommandHolder holder = new QueueCommandHolder(data, promise);

        Queue<QueueCommandHolder> queue = ctx.channel().attr(COMMANDS_QUEUE).get();

        while (true) {
            if (lock.compareAndSet(false, true)) {
                try {
                    queue.add(holder);
                    ctx.writeAndFlush(data, holder.getChannelPromise());
                } finally {
                    lock.set(false);
                }
                break;
            }
        }
    } else {
        super.write(ctx, msg, promise);
    }
}

此时，我们已经大致知道 queue 写入和取出的对应关系，因为 peek 不会对 queue 中的数据做 remove 操作，所以必然有个地方把 queue 中的数据推出，redisson 的做法是在解码之后推出，decode 之后会调用如下的方法：

protected void sendNext(Channel channel) {
    Queue<QueueCommandHolder> queue = channel.attr(CommandsQueue.COMMANDS_QUEUE).get();
    queue.poll();
    state(null);
}

分析到这里，我们的问题其实基本已经有解了，在案例中 stringRedisTemplate set null 之后，由于在CommandEncode阶段抛出了空指针异常，导致请求没发出去，那么肯定不会有CommandDecoder阶段了，根据 pipeline 的顺序，CommandsQueue是在CommandEncoder之前执行，这也意味着 queue 中塞入了 set null（指指令 1）请求指令，而没有解码阶段将其从 queue 中推出，这导致在下一个 set 正常值（指令 2）的命令时，请求发到了 redis，redis 也进行了响应，但是在解码阶段，取出的操作指令是上一个请求（指令 1）的，从而解码异常，而且，由于此时取出的是第一次操作的指令（指令 1），第二次的指令（指令 2）还在等待下一次从 queue 中取出并处理（promise.tryFailure），进而导致正常 set（指令 2）的超时。自此，后续所有的请求都会出现错误，因为解码阶段取出的QueueCommand永远是上一次请求的，直到PinConnectionHandler出错，并断开连接，在ConnectionWatchDog的协同处理下重置整个 channel 的数据。redisson 作为一个成熟的开源框架，有这个 bug 实际是非常不应该的，解决的方案也就清晰了：
前置增加value判断

解决方案：

数据放入Redis前，增加key、value非空判断。

感悟：

在分析 redisson 为什么会异常蔓延的过程中，实际经历的过程远比文中复杂。
一个是 redisson 出错时的日志误导性比较强，看到错误日志的第一反应，要么是网络出错，要么是 redis 服务器出错，
在分析了很久之后才排除了网络和 redis 的原因，
从而转到分析 redisson 自身的实现机制。
另外，在分析 redisson 的过程中，最开始的设想其实是在请求阶段关闭了连接，但是没有释放句柄或者引用，这也耗费了比较长的时间。
在分析过程中，也打印了不少的 TRACE 日志，这也帮助排除了不少的错误方向。
总而言之，分析问题是个比较耗时的过程，也是个不断学习进步的过程，在这个过程中，熟悉了 redisson，也为今后自己的应用代码提供了很好的借鉴。