Redis Cluster Slots风暴

最新推荐文章于 2024-08-09 23:59:16 发布

iteye_4064

最新推荐文章于 2024-08-09 23:59:16 发布

阅读量1.1k

点赞数

分类专栏： Redis 文章标签：数据库

本文链接：https://blog.csdn.net/iteye_4064/article/details/82674241

版权

Redis 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

本文分析了一次Redis集群服务不稳定的情况，表现为请求RT波动大、某些Redis实例输出字节增加及CPU使用率上升。问题源于旧版Jedis客户端引发的`cluster slots`命令风暴。通过抓包发现，1秒内执行了266次`cluster slots`，导致Redis实例成为热点。解决方案是升级到Jedis 2.8.0及以上版本，利用Shuffled分布请求，避免单个Redis Server过载。

摘要由CSDN通过智能技术生成

近期Redis集群服务出现不稳定的情况，请求的总体RT波动性很大，出现请求失败的情况。

表现症状

p99的请求RT耗时问题时段大幅变长
个别Redis实例的total_net_output_bytes飚高【100+Mbytes】
个别Redis实例的cmdstats_cluster命令执行的次数飙高
个别Redis实例CPU使用率飚高

total_net_output_bytes飚高

Cluster命令次数飙高

事故现场

首先要搞明白到底谁在发送cluster指令，tcpflow官方版本抓包数据不带时间戳，这点非常不爽，问题很难被追溯，很庆幸在github上找到一个支持时间戳的tcpflow，https://github.com/mukhin/tcpflow。

tcpflow -cp -x -i eth0 -cp dst port 12884 |grep --line-buffered -i -C2 'cluster' > tcpflow.12884.2016-10-07-11.log 2>&1

抓包数据

2016-10-08 17:08:50 010.192.192.186.64292-010.192.192.192.12884: *2
2016-10-08 17:08:50 $7
2016-10-08 17:08:50 CLUSTER
2016-10-08 17:08:50 $5
2016-10-08 17:08:50 slots

发送的来源

99 010.192.192.090
87 010.192.192.181
78 010.192.192.108
68 010.192.192.121
65 010.192.192.024
64 010.192.192.063
62 010.192.192.089
61 010.192.192.124
60 010.192.192.192
57 010.192.192.164
56 010.192.192.064
52 010.192.192.161
50 010.192.192.127
48 010.192.192.123
46 010.192.192.118
45 010.192.192.065
43 010.192.192.122
42 010.192.192.125
39 010.192.192.120
39 010.192.192.192
36 010.192.192.119
35 010.192.192.126
33 010.192.192.109
17 010.192.192.163
15 010.192.192.162
1 010.192.192.066

发送的频率

266 17:08:51
204 17:08:53
193 17:08:52
120 17:08:54
98 17:08:50
93 17:08:56
92 17:08:55
82 17:08:59
80 17:08:57
70 17:08:58

根据ip很容易判断出这些命令是Jedis客户端发送，1秒执行了266次，行成了cluster slots风暴，由于频繁执行cluster slots，而且输出的内容体积很大，这就可以解释Redis实例输出内容变多的原因。

分析jedis客户端

目前使用的是jedis比较老的一个版本2.7.0，jedis运行过程中当出现网络波动会导致请求随机节点，然后随机节点会返回Moved指令，这时会导致大量客户端更新slot【JedisClusterConnectionHandler.renewSlotCache()】的请求，由于选择Redis Server时没有做Shuffled会导致所有的cluster slots请求发给同一个Redis Server，Cluster slots会让这个Redis会出现严重的热点问题，进而影响到集群的吞吐以及稳定。

private T runWithRetries(String key, int redirections, boolean tryRandomNode, boolean asking) {
    if (redirections <= 0) {
      throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?");
    }

    Jedis connection = null;
    try {

      if (asking) {
        // TODO: Pipeline asking with the original command to make it
        // faster....
        connection = askConnection.get();
        connection.asking();

        // if asking success, reset asking flag
        asking = false;
      } else {
        if (tryRandomNode) {
          connection = connectionHandler.getConnection();
        } else {
          connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
        }
      }

      return execute(connection);
    } catch (JedisConnectionException jce) {
      if (tryRandomNode) {
        // maybe all connection is down
        throw jce;
      }

      releaseConnection(connection, true);
      connection = null;

      // retry with random connection
      return runWithRetries(key, redirections - 1, true, asking);
    } catch (JedisRedirectionException jre) {
      if (jre instanceof JedisAskDataException) {
        asking = true;
        askConnection.set(this.connectionHandler.getConnectionFromNode(jre.getTargetNode()));
      } else if (jre instanceof JedisMovedDataException) {
        // it rebuilds cluster's slot cache
        // recommended by Redis cluster specification
        this.connectionHandler.renewSlotCache();
      } else {
        throw new JedisClusterException(jre);
      }

      releaseConnection(connection, false);
      connection = null;

      return runWithRetries(key, redirections - 1, false, asking);
    } finally {
      releaseConnection(connection, false);
    }

  }

  public void renewSlotCache() {
    for (JedisPool jp : cache.getNodes().values()) {
      Jedis jedis = null;
      try {
        jedis = jp.getResource();
        cache.discoverClusterSlots(jedis);
        break;
      } finally {
        if (jedis != null) {
          jedis.close();
        }
      }
    }
  }