实例集群状态为Fail导致的集群混乱排查和复现

背景,公司的缓存管理云平台的实例克隆功能(机器宕机后将宕机的实例在其他服务器进行启动,然后与原集群进行主从关系维护),在一次拷贝以后出现了集群混乱,初步定位为集群在下线实例的时候,没有成功forget掉下线的节点,集群中仍然有该实例的ip:port信息,但是状态为fail状态,当该ip和端口再度启动并加入其他集群时,自动被原集群召回,导致集群混乱。

目前需要确认几个问题:

  1. 集群是否会有脏数据(指的是不同集群冲突槽位的数据是否混淆)
  2. 集群混乱的触发原因是什么
  3. 集群混乱以后槽位、数据的覆盖的逻辑

一、环境准备

准备4台CENTOS虚拟机服务器,ip设置如下。

10.4.7.221
10.4.7.222
10.4.7.223
10.4.7.224

集群规划如下,三个集群分别部署在221、222、223三台机器上,224作为克隆机器。集群的主从端口要有相应规则,如 8001为主,8011为从,这样一方面方便编写启动和集群匹配主从的脚本,另一方面,方便集群混乱以后查看集群的节点的情况。

A 3主3从

10.4.7.221:8001
10.4.7.222:8011

10.4.7.222:8002
10.4.7.223:8012

10.4.7.223:8003
10.4.7.221:8013

B 4主4从

10.4.7.221:9001
10.4.7.222:9011

10.4.7.222:9002
10.4.7.223:9012

10.4.7.223:9003
10.4.7.221:9013

10.4.7.221:9004
10.4.7.222:9014

C 5主5从

10.4.7.221:7001
10.4.7.222:7011

10.4.7.222:7002
10.4.7.223:7012

10.4.7.223:7003
10.4.7.221:7013

10.4.7.221:7004
10.4.7.222:7014

10.4.7.222:7005
10.4.7.223:7015

目录规划

日志文件目录
/app/cachecloud/logs
配置文件目录
/app/cachecloud/conf
数据文件目录
/app/cachecloud/data

配置文件模板,将所有规划的实例的配置文件都创建在/app/cachecloud/conf目录下,redis安装编译步骤省略。

port 8001
cluster-enabled yes
cluster-node-timeout 15000
cluster-config-file "nodes-8001.conf"
bind 0.0.0.0
daemonize yes
logfile "/app/cachecloud/logs/redis-a-8001.log"
dir /app/cachecloud/data
dbfilename dump-8001.rdb

分别执行调整配置

#关闭linux大页设置
echo never > /sys/kernel/mm/transparent_hugepage/enabled
#内核允许分配所有的物理内存,而不管当前的内存状态如何。
echo 1 > /proc/sys/vm/overcommit_memory
sysctl vm.overcommit_memory=1
#关闭防火墙(很重要,否则集群通信会失败)
systemctl stop firewalld

重新配置集群,删除配置(kill redis进程,删除缓存的集群nodes配置文件、日志文件、数据文件)

cd /app/cachecloud/conf;ps -ef | grep redis-server | grep -v grep | awk '{print $2}' | xargs kill;rm -f dump.rdb nodes-*.conf;rm -f /app/cachecloud/logs/*.log;rm -f /app/cachecloud/data/*.rdb

221,222,223分别执行以下启动实例

cd /app/cachecloud/conf;for i in `ls -l | grep redis | awk '{print $9}'`;do echo `redis-server $i`; done;

在221服务器执行meet集群节点命令

#集群A
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.221 8001
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.222 8011
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.222 8002
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.223 8012
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.223 8003
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.221 8013

#集群B
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.221 9001
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.222 9011
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.222 9002
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.223 9012
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.223 9003
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.221 9013
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.221 9004
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.222 9014

集群C
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.221 7001
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7011
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7002
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 7012
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 7003
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.221 7013
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.221 7004
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7014
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7005
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 7015

给三个集群分配槽位,集群实例分配的槽位细碎一些

redis-cli -h 10.4.7.221 -p 8001 cluster addslots {0..5555}
redis-cli -h 10.4.7.222 -p 8002 cluster addslots {5556..11112}
redis-cli -h 10.4.7.223 -p 8003 cluster addslots {11113..16383}

redis-cli -h 10.4.7.221 -p 9001 cluster addslots {0..4096}
redis-cli -h 10.4.7.222 -p 9002 cluster addslots {4097..8192}
redis-cli -h 10.4.7.223 -p 9003 cluster addslots {8193..12288}
redis-cli -h 10.4.7.221 -p 9004 cluster addslots {12289..16383}

redis-cli -h 10.4.7.221 -p 7001 cluster addslots {0..3278}
redis-cli -h 10.4.7.222 -p 7002 cluster addslots {3279..6556}
redis-cli -h 10.4.7.223 -p 7003 cluster addslots {6557..9834}
redis-cli -h 10.4.7.221 -p 7004 cluster addslots {9835..13112}
redis-cli -h 10.4.7.222 -p 7005 cluster addslots {13113..16383}

221执行主从对应

masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep :8001 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 8011 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep :8002 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 8012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep :8003 | awk '{print $1}'`;redis-cli -h 10.4.7.221 -p 8013 cluster replicate $masternodeid;

masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9001 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 9011 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9002 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 9012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9003 | awk '{print $1}'`;redis-cli -h 10.4.7.221 -p 9013 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9004 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 9014 cluster replicate $masternodeid;

masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7001 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 7011 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7002 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 7012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7003 | awk '{print $1}'`;redis-cli -h 10.4.7.221 -p 7013 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7004 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 7014 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7005 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 7015 cluster replicate $masternodeid;

集群准备好以后,查看集群的状态,每次执行操作以后都要记录一下,便于比对实例的变更状态。

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes


[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648189621790 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648189623797 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 slave e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 0 1648189624800 5 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 0 1648189623295 4 connected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 10.4.7.223:8003 master - 0 1648189622794 5 connected 11113-16383

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave 5f201e00106a512ab3a3d73455ee1269b367b204 0 1648189622893 5 connected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648189619884 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648189620888 4 connected 4097-8192
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 slave ca7e61ba84d005dc32570377a7e57cdc2b1ea395 0 1648189617878 3 connected
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648189623898 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648189621891 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master - 0 1648189622392 2 connected 8193-12288
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648189621892 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 0 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648189623900 9 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master - 0 1648189624904 6 connected 6557-9834
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 slave 94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 0 1648189620887 6 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave 3df4258d28c93acb80444e56a330c5331681d413 0 1648189620387 7 connected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 1648189622393 5 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648189622894 9 connected 9835-13112
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648189622394 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648189619886 2 connected

执行python脚本向3个集群中各插入100w条数据,每个集群的数据要有自己的规则,便于查看集群混乱后的数据分布状况。

import redis
import time
import traceback
import random
from time import ctime, sleep
from rediscluster import StrictRedisCluster

# 生成随机字符串
def generate_random_str(randomlength=16):
    random_str = ''
    base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
    length = len(base_str) - 1
    for i in range(randomlength):
        random_str += base_str[random.randint(0, length)]
    return random_str

def generate_random_num(randomlength=100):
    return random.randint(0, randomlength)

# 集群A
startup_nodes = [
    {"host": "10.4.7.221", "port": 8001},
    {"host": "10.4.7.222", "port": 8011},
    {"host": "10.4.7.222", "port": 8002},
    {"host": "10.4.7.223", "port": 8012},
    {"host": "10.4.7.223", "port": 8003},
    {"host": "10.4.7.221", "port": 8013}
]

redis_conn = StrictRedisCluster(
    startup_nodes=startup_nodes, decode_responses=True, password='')
p = redis_conn.pipeline()
for i in range(0, 1000000):
    p.set('cluster:A:hello_'+str(i).zfill(8), generate_random_str(10))
    if i % 1000 == 0:
        p.execute()
        print("========>executed:{}".format(i))
import redis
import time
import traceback
import random
from time import ctime, sleep
from rediscluster import StrictRedisCluster

# 生成随机字符串
def generate_random_str(randomlength=16):
    random_str = ''
    base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
    length = len(base_str) - 1
    for i in range(randomlength):
        random_str += base_str[random.randint(0, length)]
    return random_str

def generate_random_num(randomlength=100):
    return random.randint(0, randomlength)

# 集群B
startup_nodes = [
    {"host": "10.4.7.221","port": 9001},
    {"host": "10.4.7.222","port": 9011},
    {"host": "10.4.7.222","port": 9002},
    {"host": "10.4.7.223","port": 9012},
    {"host": "10.4.7.223","port": 9003},
    {"host": "10.4.7.221","port": 9013},
    {"host": "10.4.7.221","port": 9004},
    {"host": "10.4.7.222","port": 9014}
]

redis_conn = StrictRedisCluster(
    startup_nodes=startup_nodes, decode_responses=True, password='')

p = redis_conn.pipeline()
for i in range(0, 1000000):
    # 字符串
    p.set('cluster:B:hello_'+str(i).zfill(8), generate_random_str(10))
    if i % 1000 == 0:
        p.execute()
        print("========>executed:{}".format(i))

import redis
import time
import traceback
import random
from time import ctime, sleep
from rediscluster import StrictRedisCluster

# 生成随机字符串
def generate_random_str(randomlength=16):
    random_str = ''
    base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
    length = len(base_str) - 1
    for i in range(randomlength):
        random_str += base_str[random.randint(0, length)]
    return random_str

def generate_random_num(randomlength=100):
    return random.randint(0, randomlength)

# 集群C
startup_nodes = [
    {"host": "10.4.7.221","port": 7001},
    {"host": "10.4.7.222","port": 7011},
    {"host": "10.4.7.222","port": 7002},
    {"host": "10.4.7.223","port": 7012},
    {"host": "10.4.7.223","port": 7003},
    {"host": "10.4.7.221","port": 7013},
    {"host": "10.4.7.221","port": 7004},
    {"host": "10.4.7.222","port": 7014},
    {"host": "10.4.7.222","port": 7005},
    {"host": "10.4.7.223","port": 7015}
]


redis_conn = StrictRedisCluster(
    startup_nodes=startup_nodes, decode_responses=True, password='')

p = redis_conn.pipeline()
for i in range(0, 1000000):
    # 字符串
    p.set('cluster:C:hello_'+str(i).zfill(8), generate_random_str(10))
    if i % 1000 == 0:
        p.execute()
        print("========>executed:{}".format(i))

插入数据后记录一下集群各节点的数据数量和内存占用,用于比对集群混乱后的的数据分布情况。

redis-cli -h 10.4.7.221 -p 8001 dbsize
redis-cli -h 10.4.7.222 -p 8002 dbsize
redis-cli -h 10.4.7.223 -p 8003 dbsize
redis-cli -h 10.4.7.221 -p 9001 dbsize
redis-cli -h 10.4.7.222 -p 9002 dbsize
redis-cli -h 10.4.7.223 -p 9003 dbsize
redis-cli -h 10.4.7.221 -p 9004 dbsize
redis-cli -h 10.4.7.221 -p 7001 dbsize
redis-cli -h 10.4.7.222 -p 7002 dbsize
redis-cli -h 10.4.7.223 -p 7003 dbsize
redis-cli -h 10.4.7.221 -p 7004 dbsize
redis-cli -h 10.4.7.222 -p 7005 dbsize

redis-cli -h 10.4.7.222 -p 8011 dbsize
redis-cli -h 10.4.7.223 -p 8012 dbsize
redis-cli -h 10.4.7.221 -p 8013 dbsize
redis-cli -h 10.4.7.222 -p 9011 dbsize
redis-cli -h 10.4.7.223 -p 9012 dbsize
redis-cli -h 10.4.7.221 -p 9013 dbsize
redis-cli -h 10.4.7.222 -p 9014 dbsize
redis-cli -h 10.4.7.222 -p 7011 dbsize
redis-cli -h 10.4.7.223 -p 7012 dbsize
redis-cli -h 10.4.7.221 -p 7013 dbsize
redis-cli -h 10.4.7.222 -p 7014 dbsize
redis-cli -h 10.4.7.223 -p 7015 dbsize

redis-cli -h 10.4.7.221 -p 8001 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 8002 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 8003 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 9001 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 9002 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 9003 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 9004 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 7001 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7002 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 7003 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 7004 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7005 info memory | grep used_memory_human

redis-cli -h 10.4.7.222 -p 8011 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 8012 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 8013 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 9011 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 9012 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 9013 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 9014 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7011 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 7012 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 7013 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7014 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 7015 info memory | grep used_memory_human



[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 8001 dbsize
(integer) 338674
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 8002 dbsize
(integer) 338920
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 8003 dbsize
(integer) 321407
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 9001 dbsize
(integer) 249806
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 9002 dbsize
(integer) 249755
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 9003 dbsize
(integer) 249754
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 9004 dbsize
(integer) 249686
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 7001 dbsize
(integer) 199887
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7002 dbsize
(integer) 199862
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 7003 dbsize
(integer) 200014
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 7004 dbsize
(integer) 199857
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7005 dbsize
(integer) 199381
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 8011 dbsize
(integer) 338674
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 8012 dbsize
(integer) 338920
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 8013 dbsize
(integer) 321407
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 9011 dbsize
(integer) 249806
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 9012 dbsize
(integer) 249755
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 9013 dbsize
(integer) 249754
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 9014 dbsize
(integer) 249686
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7011 dbsize
(integer) 199887
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 7012 dbsize
(integer) 199862
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 7013 dbsize
(integer) 200014
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7014 dbsize
(integer) 199857
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 7015 dbsize
(integer) 199381


[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 8001 info memory | grep used_memory_human
used_memory_human:65.04M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 8002 info memory | grep used_memory_human
used_memory_human:65.09M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 8003 info memory | grep used_memory_human
used_memory_human:62.05M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 9001 info memory | grep used_memory_human
used_memory_human:47.64M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 9002 info memory | grep used_memory_human
used_memory_human:47.64M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 9003 info memory | grep used_memory_human
used_memory_human:47.65M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 9004 info memory | grep used_memory_human
used_memory_human:47.63M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 7001 info memory | grep used_memory_human
used_memory_human:39.01M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7002 info memory | grep used_memory_human
used_memory_human:39.01M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 7003 info memory | grep used_memory_human
used_memory_human:39.03M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 7004 info memory | grep used_memory_human
used_memory_human:39.01M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7005 info memory | grep used_memory_human
used_memory_human:38.97M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 8011 info memory | grep used_memory_human
used_memory_human:64.03M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 8012 info memory | grep used_memory_human
used_memory_human:64.07M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 8013 info memory | grep used_memory_human
used_memory_human:61.08M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 9011 info memory | grep used_memory_human
used_memory_human:46.64M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 9012 info memory | grep used_memory_human
used_memory_human:46.64M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 9013 info memory | grep used_memory_human
used_memory_human:46.65M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 9014 info memory | grep used_memory_human
used_memory_human:46.63M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7011 info memory | grep used_memory_human
used_memory_human:38.02M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 7012 info memory | grep used_memory_human
used_memory_human:38.01M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.221 -p 7013 info memory | grep used_memory_human
used_memory_human:38.04M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.222 -p 7014 info memory | grep used_memory_human
used_memory_human:38.02M
[root@redis-7-221 conf]# redis-cli -h 10.4.7.223 -p 7015 info memory | grep used_memory_human
used_memory_human:37.93M

二、关机迁移实例到其他服务器启动

将223服务器的配置文件拷贝到224服务器相同目录上。然后223服务器关机,模拟服务器宕机情景。

224执行启动223相同配置实例的脚本:

cd /app/cachecloud/conf;for i in `ls -l | grep redis | awk '{print $9}'`;do echo `redis-server $i`; done;

224执行 meet新启动的实例到原来的集群

redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.224 8012
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.224 8003
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.224 9012
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.224 9003
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.224 7012
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.224 7003
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.224 7015

224 执行 主从对应脚本

masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep -v 223:| grep :8013 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 8003 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep -v 223:| grep :8002 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 8012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep -v 223:| grep :9013 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 9003 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep -v 223:| grep :9002 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 9012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep -v 223:| grep :7013 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 7003 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep -v 223:| grep :7002 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 7012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep -v 223:| grep :7005 | awk '{print $1}'`;redis-cli -h 10.4.7.224 -p 7015 cluster replicate $masternodeid;

记录下集群的节点状态:

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes


[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648196595409 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648196597418 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648196596412 6 connected 11113-16383
367718bbbc764cfa68d965fa680daa9eaa761917 10.4.7.224:8012 slave b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 0 1648196593902 7 connected
d2bf2df82f454d7a8ffc3be86a7b1d6f63f41df8 10.4.7.224:8003 slave 3e42b6080162017b33ae87caac67daad9eb77046 0 1648196591390 6 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196457015 1648196453603 4 connected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 10.4.7.223:8003 master,fail - 1648196457116 1648196456616 5 connected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196457015 1648196454106 5 connected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648196597418 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648196592393 4 connected 4097-8192
bd4d895a634d586d645a3b2fe3f0d5217e9d1f32 10.4.7.224:9003 slave 577e5ea9c7f56c3767cfcfa23905742d97448b02 0 1648196593903 9 connected
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648196597917 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648196595408 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648196596413 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master,fail - 1648196457015 1648196454609 2 connected
d8ed793b07efaad87a03315808f9e92b462f2fff 10.4.7.224:9012 slave 5f201e00106a512ab3a3d73455ee1269b367b204 0 1648196594405 4 connected
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
ba48e3d56b1405af07e34e6fbc34264fe4b893a6 10.4.7.224:7015 slave 3df4258d28c93acb80444e56a330c5331681d413 0 1648196594406 11 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648196597418 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 0 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648196596915 9 connected
1fca4edaf4d453ac3daddbae0ed23f64370a5bf6 10.4.7.224:7012 slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 1648196595409 12 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 connected
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648196592393 10 connected 6557-9834
7cf3322b4c2da99138b4daf121c336087ee2fd05 10.4.7.224:7003 slave 955014596d1929366f655ad6a020195432da917e 0 1648196595209 13 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196457015 1648196454106 7 connected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196457015 1648196449588 5 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648196598420 9 connected 9835-13112
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648196593400 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648196596414 2 connected
[root@redis-7-221 conf]#

等一会A、B、C集群主从节点数据同步完成
224执行关机,模拟服务器宕机状态。

记录下集群个节点状态:

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes

[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648196853711 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648196851701 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648196852704 6 connected 11113-16383
367718bbbc764cfa68d965fa680daa9eaa761917 10.4.7.224:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196696080 1648196690858 7 connected
d2bf2df82f454d7a8ffc3be86a7b1d6f63f41df8 10.4.7.224:8003 slave,fail 3e42b6080162017b33ae87caac67daad9eb77046 1648196696081 1648196694878 6 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196457015 1648196453603 4 connected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 10.4.7.223:8003 master,fail - 1648196457116 1648196456616 5 connected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196457015 1648196454106 5 connected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648196853611 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648196853106 4 connected 4097-8192
bd4d895a634d586d645a3b2fe3f0d5217e9d1f32 10.4.7.224:9003 slave,fail 577e5ea9c7f56c3767cfcfa23905742d97448b02 1648196696058 1648196694353 9 connected
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648196850595 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648196851599 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648196852605 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master,fail - 1648196457015 1648196454609 2 connected
d8ed793b07efaad87a03315808f9e92b462f2fff 10.4.7.224:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196696059 1648196695859 4 connected
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
ba48e3d56b1405af07e34e6fbc34264fe4b893a6 10.4.7.224:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196696163 1648196695859 11 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648196850597 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 0 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648196855119 9 connected
1fca4edaf4d453ac3daddbae0ed23f64370a5bf6 10.4.7.224:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196696056 1648196689348 12 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 connected
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648196854614 10 connected 6557-9834
7cf3322b4c2da99138b4daf121c336087ee2fd05 10.4.7.224:7003 slave,fail 955014596d1929366f655ad6a020195432da917e 1648196696056 1648196691350 13 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196457015 1648196454106 7 connected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196457015 1648196449588 5 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648196854112 9 connected 9835-13112
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648196852606 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648196849590 2 connected
[root@redis-7-221 conf]#

三、开始错位启动集群实例

223开机,关闭大页、设置内存分配机制、关闭防火墙

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 1 > /proc/sys/vm/overcommit_memory
sysctl vm.overcommit_memory=1
systemctl stop firewalld

1)第一次错位集群实例启动测试

223启动错位集群端口进行测试(此时并未删除rdb文件),错位端口的实例要挑选原来在223服务器启动的关机后在集群中仍是fail状态的实例,如下所示。要将这些实例启动并meet到其他的集群,以复现集群错乱的情况。

#集群A
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 10.4.7.223:8003 master,fail - 1648196457116 1648196456616 5 connected
#集群B
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master,fail - 1648196457015 1648196454609 2 connected
#集群C
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 connected

启动8003实例,看日志等集群开始同步数据,就将其meet到其他集群

redis-server redis-a-8003.conf

等了一段时间,看实例的启动日志/app/cachecloud/logs/redis-a-8003.log,未进行数据同步

223服务器8003实例启动日志
1415:M 25 Mar 16:30:56.808 * Increased maximum number of open files to 10032 (it was originally set to 1024).
1415:M 25 Mar 16:30:56.811 * No cluster configuration found, I'm f91f4853ed6b725e5aff15377e887b25bb53bee3
1415:M 25 Mar 16:30:56.813 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1415:M 25 Mar 16:30:56.813 # Server started, Redis version 3.2.3
1415:M 25 Mar 16:30:56.814 * DB loaded from disk: 0.000 seconds
1415:M 25 Mar 16:30:56.814 * The server is now ready to accept connections on port 8003
1415:M 25 Mar 16:30:56.893 # IP address for this node updated to 10.4.7.223

等一段时间集群发现未进行数据同步,将端口meet到其他的集群中(原来是800x端口的A集群,由7001端口的C集群将节点meet进来)

redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 8003

原A集群节点实例正常meet到其他集群C集群中,未复现除集群错乱情况。
PS: 这里要注意,要由其他节点meet本机实例,而不要由本机实例meet其他集群实例,因为正常的流程是 启动实例 -> meet节点 -> 主从配对,本次试验要复现的是启动节点后原集群发现当前实例,然后进行主从数据同步,然后再由其他集群将本实例meet进来,两步操作不能太快,否则不能复现除那种情况。

记录下当前A、B、C节点的状态

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes


[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648197599294 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648197593238 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648197600302 6 connected 11113-16383
367718bbbc764cfa68d965fa680daa9eaa761917 10.4.7.224:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196696080 1648196690858 7 connected
d2bf2df82f454d7a8ffc3be86a7b1d6f63f41df8 10.4.7.224:8003 slave,fail 3e42b6080162017b33ae87caac67daad9eb77046 1648196696081 1648196694878 6 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196457015 1648196453603 4 disconnected
#原集群的该端口实例的fail信息变成了  :0 master,fail,noaddr
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed :0 master,fail,noaddr - 1648196457116 1648196456616 5 disconnected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196457015 1648196454106 5 disconnected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648197599901 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648197599399 4 connected 4097-8192
bd4d895a634d586d645a3b2fe3f0d5217e9d1f32 10.4.7.224:9003 slave,fail 577e5ea9c7f56c3767cfcfa23905742d97448b02 1648196696058 1648196694353 9 connected
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648197600405 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648197598390 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648197596371 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master,fail - 1648196457015 1648196454609 2 disconnected
d8ed793b07efaad87a03315808f9e92b462f2fff 10.4.7.224:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196696059 1648196695859 4 connected
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
ba48e3d56b1405af07e34e6fbc34264fe4b893a6 10.4.7.224:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196696163 1648196695859 11 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648197600409 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 14 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648197599907 9 connected
1fca4edaf4d453ac3daddbae0ed23f64370a5bf6 10.4.7.224:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196696056 1648196689348 12 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 disconnected
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648197596875 10 connected 6557-9834
7cf3322b4c2da99138b4daf121c336087ee2fd05 10.4.7.224:7003 slave,fail 955014596d1929366f655ad6a020195432da917e 1648196696056 1648196691350 13 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196457015 1648196454106 7 disconnected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196457015 1648196449588 5 disconnected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648197600910 9 connected 9835-13112
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648197599907 0 connected
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648197595867 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648197597887 14 connected

2)第二次错位集群实例启动测试

223服务器继续启动错位集群端口进行测试

#集群B
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master,fail - 1648196457015 1648196454609 2 connected

启动9003实例,看日志等集群开始同步数据,就将其meet到其他集群,但未进行数据同步

redis-server redis-b-9003.conf

端口启动以后,等一段时间集群发现未进行数据同步,然后由其他的集群加入meet该实例。

redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 9003

正常加入meet到其他集群,未复现集群错乱情况。

记录下当前A、B、C集群各实例状态。


redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes


[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648198584773 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648198581751 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648198583764 6 connected 11113-16383
367718bbbc764cfa68d965fa680daa9eaa761917 10.4.7.224:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196696080 1648196690858 7 connected
d2bf2df82f454d7a8ffc3be86a7b1d6f63f41df8 10.4.7.224:8003 slave,fail 3e42b6080162017b33ae87caac67daad9eb77046 1648196696081 1648196694878 6 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196457015 1648196453603 4 disconnected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed :0 master,fail,noaddr - 1648196457116 1648196456616 5 disconnected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196457015 1648196454106 5 disconnected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648198582961 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648198580441 4 connected 4097-8192
bd4d895a634d586d645a3b2fe3f0d5217e9d1f32 10.4.7.224:9003 slave,fail 577e5ea9c7f56c3767cfcfa23905742d97448b02 1648196696058 1648196694353 9 connected
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648198581951 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648198584973 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648198583968 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 :0 master,fail,noaddr - 1648196457015 1648196454609 2 disconnected
d8ed793b07efaad87a03315808f9e92b462f2fff 10.4.7.224:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196696059 1648196695859 4 connected
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
ba48e3d56b1405af07e34e6fbc34264fe4b893a6 10.4.7.224:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196696163 1648196695859 11 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648198586039 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 14 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648198580979 9 connected
1fca4edaf4d453ac3daddbae0ed23f64370a5bf6 10.4.7.224:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196696056 1648196689348 12 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 disconnected
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648198585532 10 connected 6557-9834
7cf3322b4c2da99138b4daf121c336087ee2fd05 10.4.7.224:7003 slave,fail 955014596d1929366f655ad6a020195432da917e 1648196696056 1648196691350 13 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196457015 1648196454106 7 disconnected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196457015 1648196449588 5 disconnected
41b5990df1d6b1f41409c3fbfc98e696a37a3cac 10.4.7.223:9003 master - 0 1648198582499 15 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648198584519 9 connected 9835-13112
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648198581992 0 connected
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648198583509 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648198581484 14 connected

3)第三次错位集群实例启动测试

223继续启动错位集群端口进行测试

#集群C
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 connected

启动7003实例,看日志等集群开始同步数据,就将其meet到其他集群

redis-server redis-c-7003.conf

未进行数据同步,记录下集群各实例状态。

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes

[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648198837001 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648198839017 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648198838007 6 connected 11113-16383
367718bbbc764cfa68d965fa680daa9eaa761917 10.4.7.224:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196696080 1648196690858 7 connected
d2bf2df82f454d7a8ffc3be86a7b1d6f63f41df8 10.4.7.224:8003 slave,fail 3e42b6080162017b33ae87caac67daad9eb77046 1648196696081 1648196694878 6 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196457015 1648196453603 4 disconnected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed :0 master,fail,noaddr - 1648196457116 1648196456616 5 disconnected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196457015 1648196454106 5 disconnected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648198837202 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648198838211 4 connected 4097-8192
bd4d895a634d586d645a3b2fe3f0d5217e9d1f32 10.4.7.224:9003 slave,fail 577e5ea9c7f56c3767cfcfa23905742d97448b02 1648196696058 1648196694353 9 connected
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648198839215 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648198836697 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648198837706 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 :0 master,fail,noaddr - 1648196457015 1648196454609 2 disconnected
d8ed793b07efaad87a03315808f9e92b462f2fff 10.4.7.224:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196696059 1648196695859 4 connected
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
ba48e3d56b1405af07e34e6fbc34264fe4b893a6 10.4.7.224:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196696163 1648196695859 11 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648198837708 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 14 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648198836701 9 connected
1fca4edaf4d453ac3daddbae0ed23f64370a5bf6 10.4.7.224:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196696056 1648196689348 12 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 :0 master,fail,noaddr - 1648196457015 1648196454609 6 disconnected
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648198837203 10 connected 6557-9834
7cf3322b4c2da99138b4daf121c336087ee2fd05 10.4.7.224:7003 slave,fail 955014596d1929366f655ad6a020195432da917e 1648196696056 1648196691350 13 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196457015 1648196454106 7 disconnected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196457015 1648196449588 5 disconnected
41b5990df1d6b1f41409c3fbfc98e696a37a3cac 10.4.7.223:9003 master - 0 1648198834178 15 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648198838211 9 connected 9835-13112
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648198836198 0 connected
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648198834178 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648198839221 14 connected
c905ac47b6a74a6be2ae6ef2acb102b86463a192 10.4.7.223:7003 master - 0 1648198835187 16 connected

端口启动以后,等一段时间集群发现未进行数据同步,执行命令将端口meet到其他的集群中

redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.223 7003

查看集群状态,出现了集群混乱,从数据查看,发现集群A和集群C的节点已经混乱。现象是两个集群的实例数已经完全一致了,原来分属两个集群的实例,现在分别同时在两个集群中出现。

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes

[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 1648199003766 4 connected
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648199007813 14 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 0 1 connected
c905ac47b6a74a6be2ae6ef2acb102b86463a192 10.4.7.223:7003 master - 0 1648199004272 16 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648199001749 4 connected 3279-6556
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648199003261 14 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648199007812 9 connected 9835-13112
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648199005282 10 connected 6557-9834
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648199006798 9 connected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed :0 master,fail,noaddr - 1648196457116 1648196456616 5 disconnected
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 slave 3e42b6080162017b33ae87caac67daad9eb77046 0 1648199006798 6 connected
25c3ef545c93e7f55b9a86d3841ae6216eb66582 10.4.7.223:7015 handshake - 1648198996399 0 0 disconnected
cef209f7ad1684972b3519df39884daf7e3b33a3 10.4.7.224:7003 handshake - 1648198997712 0 0 connected
367718bbbc764cfa68d965fa680daa9eaa761917 10.4.7.224:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196696080 1648196690858 7 connected
41b5990df1d6b1f41409c3fbfc98e696a37a3cac 10.4.7.223:9003 master - 0 1648199001245 15 connected
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648199002249 6 connected 13113-16383
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 master - 0 1648199007302 14 connected 0-3278
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave,fail b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 1648196457015 1648196453603 4 disconnected
9177fbf6d76a8fbf7fc615b7d3346f99fa1cc621 10.4.7.224:7015 handshake - 0 0 0 connected
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648199006291 0 connected
9c7ad087511eef9e0ce3707dc15b53870f34ecb4 10.4.7.224:7012 handshake - 1648198997712 0 0 connected
d2bf2df82f454d7a8ffc3be86a7b1d6f63f41df8 10.4.7.224:8003 slave,fail 3e42b6080162017b33ae87caac67daad9eb77046 1648196696081 1648196694878 6 connected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196457015 1648196454106 5 disconnected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648199004633 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648199005637 4 connected 4097-8192
bd4d895a634d586d645a3b2fe3f0d5217e9d1f32 10.4.7.224:9003 slave,fail 577e5ea9c7f56c3767cfcfa23905742d97448b02 1648196696058 1648196694353 9 connected
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648199006639 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648199007644 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648199002616 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 :0 master,fail,noaddr - 1648196457015 1648196454609 2 disconnected
d8ed793b07efaad87a03315808f9e92b462f2fff 10.4.7.224:9012 slave,fail 5f201e00106a512ab3a3d73455ee1269b367b204 1648196696059 1648196695859 4 connected
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 14 connected 0-3278
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 1648199004779 4 connected
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648199005283 9 connected
1fca4edaf4d453ac3daddbae0ed23f64370a5bf6 10.4.7.224:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196696056 1648196689348 12 connected
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648199002755 10 connected 6557-9834
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648199003765 6 connected 13113-16383
6312347acd744c21a57eea929316681ca5c89f3d 10.4.7.224:8003 handshake - 0 0 0 connected
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648199007813 14 connected
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 slave 3e42b6080162017b33ae87caac67daad9eb77046 0 1648199003264 6 connected
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648199007811 0 connected
ba48e3d56b1405af07e34e6fbc34264fe4b893a6 10.4.7.224:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196696163 1648196695859 11 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648199005788 4 connected 3279-6556
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 :0 master,fail,noaddr - 1648196457015 1648196454609 6 disconnected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave,fail 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 1648196457015 1648196449588 5 disconnected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave,fail 3df4258d28c93acb80444e56a330c5331681d413 1648196457015 1648196454106 7 disconnected
7cf3322b4c2da99138b4daf121c336087ee2fd05 10.4.7.224:7003 slave,fail 955014596d1929366f655ad6a020195432da917e 1648196696056 1648196691350 13 connected
41b5990df1d6b1f41409c3fbfc98e696a37a3cac 10.4.7.223:9003 master - 0 1648199008822 15 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 1648199006798 4 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648199008819 9 connected 9835-13112
33ce47dc369d22aebfae29e9e8c39d48ad53e1aa 10.4.7.224:8012 handshake - 1648198997711 0 0 connected
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648199008315 14 connected
c905ac47b6a74a6be2ae6ef2acb102b86463a192 10.4.7.223:7003 master - 0 1648199001747 16 connected
[root@redis-7-221 conf]#

记录下集群混乱下各个集群的有效节点。

redis-cli -p 8001 cluster nodes | grep -v fail | grep -v handshake | grep -v slave
redis-cli -p 9001 cluster nodes | grep -v fail | grep -v handshake | grep -v slave
redis-cli -p 7001 cluster nodes | grep -v fail | grep -v handshake | grep -v slave

[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes | grep -v fail | grep -v handshake | grep -v slave
c905ac47b6a74a6be2ae6ef2acb102b86463a192 10.4.7.223:7003 master - 0 1648199296535 16 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648199300058 4 connected 3279-6556
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648199294516 9 connected 9835-13112
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648199298045 10 connected 6557-9834
41b5990df1d6b1f41409c3fbfc98e696a37a3cac 10.4.7.223:9003 master - 0 1648199299555 15 connected
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648199293000 6 connected 13113-16383
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 master - 0 1648199295021 14 connected 0-3278
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648199299055 0 connected

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes | grep -v fail | grep -v handshake | grep -v slave
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648199296628 4 connected 4097-8192
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 master - 0 1648199296122 8 connected 8193-12288
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648199300160 6 connected 12289-16383
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes | grep -v fail | grep -v handshake | grep -v slave
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 14 connected 0-3278
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 master - 0 1648199298551 10 connected 6557-9834
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 master - 0 1648199297542 6 connected 13113-16383
f91f4853ed6b725e5aff15377e887b25bb53bee3 10.4.7.223:8003 master - 0 1648199299557 0 connected
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648199299056 4 connected 3279-6556
41b5990df1d6b1f41409c3fbfc98e696a37a3cac 10.4.7.223:9003 master - 0 1648199298049 15 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648199298551 9 connected 9835-13112
c905ac47b6a74a6be2ae6ef2acb102b86463a192 10.4.7.223:7003 master - 0 1648199300568 16 connected

错乱前集群的槽位分布

A集群
10.4.7.221 8001 {0..5555}
10.4.7.222 8002 {5556..11112}
10.4.7.223 8003 {11113..16383}
B集群
10.4.7.221 9001 {0..4096}
10.4.7.222 9002 {4097..8192}
10.4.7.223 9003 {8193..12288}
10.4.7.221 9004 {12289..16383}
C集群
10.4.7.221 7001 {0..3278}
10.4.7.222 7002 {3279..6556}
10.4.7.223 7003 {6557..9834}
10.4.7.221 7004 {9835..13112}
10.4.7.222 7005 {13113..16383}

集群错乱后槽位分布

10.4.7.221:7001 0-3278
10.4.7.222:7002 3279-6556
10.4.7.221:7013 6557-9834
10.4.7.221:7004 9835-13112
10.4.7.221:8013 13113-16383

可以看出槽位的分布与原C集群的槽位分布是相同的。查看各个实例的数据,可以看到集群的数据分布。

10.4.7.221:7001 0-3278
10.4.7.222:7002 3279-6556
10.4.7.221:7013 6557-9834
#以上三个节点都是集群C的数据
10.4.7.221:7004 9835-13112
#以上一个节点都是集群A的数据

以下集群实例和纪元槽位的信息是在集群混乱前记录的,和混乱后集群的主节点数量和槽位分配契合

集群    		纪元	 槽位
集群A
10.4.7.222:8002 3	 5556-11112
10.4.7.222:8011 1	 
10.4.7.221:8001 1	 0-5555
10.4.7.221:8013 6	 11113-16383
集群C
10.4.7.222:7002 4	 3279-6556
10.4.7.221:7001 14	 0-3278
10.4.7.222:7014 9	
10.4.7.221:7013 10	 6557-9834
10.4.7.223:9003 15	
10.4.7.221:7004 9	 9835-13112
10.4.7.223:8003 0	
10.4.7.222:7005 3	 13113-16383
10.4.7.222:7011 14	
10.4.7.223:7003 16	

对比错乱前后的数据量和内存占用,可以看出,错乱前后槽位未变的情况下,实例的数据前后一致,内存差距不大。

10.4.7.221:7001 0-3278
10.4.7.222:7002 3279-6556
10.4.7.221:7013 6557-9834
10.4.7.221:7004 9835-13112

而槽位变化的实例,数据和内存变化则很大

10.4.7.221:8013 11113-16383 13113-16383
数据量对比
                集群错乱后                       错乱前
10.4.7.221:8001 dbsize (integer) 199862		338674
10.4.7.222:8002 dbsize (integer) 199862		338920
10.4.7.223:8003 dbsize (integer) 0			321407
10.4.7.223:7003 dbsize (integer) 0			200014
*10.4.7.221:7004 dbsize (integer) 199857		199857
10.4.7.222:7005 dbsize (integer) 199462		199381
10.4.7.222:8011 dbsize (integer) 199887		338674
10.4.7.223:8012 dbsize Connection refused	338920
*10.4.7.221:8013 dbsize (integer) 199462		321407
10.4.7.222:7011 dbsize (integer) 199887		199887
10.4.7.223:7012 dbsize Connection refused	199862
*10.4.7.221:7013 dbsize (integer) 200014		200014
10.4.7.222:7014 dbsize (integer) 199857		199857
10.4.7.223:7015 dbsize Connection refused	199381

查看各个实例的数据,均为各实例原所属集群的数据,并未出现同槽位数据合并、混淆的情况,冲突的槽位的数据以后覆盖槽位所属集群的为准。

内存对比
                错乱前  错乱后
10.4.7.221:8001 65.04M 39.11M
10.4.7.222:8002 65.09M 38.11M
10.4.7.223:8003 62.05M 1.42M
*10.4.7.221:7001 39.01M 39.13M
*10.4.7.222:7002 39.01M 39.13M
10.4.7.223:7003 39.03M 1.43M
*10.4.7.221:7004 39.01M 39.10M
10.4.7.222:7005 38.97M 38.03M
10.4.7.222:8011 64.03M 38.10M
10.4.7.223:8012 64.07M Connection refused
*10.4.7.221:8013 61.08M 41.05M
10.4.7.222:7011 38.02M 38.11M
10.4.7.223:7012 38.01M Connection refused
*10.4.7.221:7013 38.04M 39.11M
10.4.7.222:7014 38.02M 38.12M
10.4.7.223:7015 37.93M Connection refused

根据以上的试验结果,我们可以得出一个集群混乱的过程的一个脉络。

  1. 集群C的一个实例在宕机后,在A集群中仍旧保留这fail状态
    94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master,fail - 1648196457015 1648196454609 6 disconnected
    
  2. 在使用该实例的ip端口启动后,该实例被集群A召回。
  3. 从A集群Meet该实例节点,将该节点加入A集群
  4. 该节点是集群C的一个slave节点,同时又是集群A的一个新加入的master节点,此时,该节点与集群A集群C同时通信。认为是同一个集群。
  5. 集群内的所有节点,进行通讯,选除epoch纪元值最大的实例节点存储的集群信息向各个实例进行同步,所以集群A的3主3从和集群C的5主5从混合以后的主实例的数量是5主。就是选取了10.4.7.223:7003实例,纪元epoch值为 16这个最大的纪元的集群元信息。
  6. 拥有槽位的master实例,会向集群中的所有实例广播自己的带有纪元值的槽位信息,其他实例接收到后会更新自己记录的槽位关系信息,一个实例同时接受到多个拥有相同槽位的广播信息以后,会比较两个广播信息的epoch纪元值,将纪元值比较大的覆盖到自己的槽位关系信息中。从一下对比中,可以看到,最终的槽位信息的取值。
    集群    		纪元	 槽位
    集群A
    10.4.7.222:8002 3	 5556-11112
    10.4.7.222:8011 1	 
    10.4.7.221:8001 1	 0-5555
    10.4.7.221:8013 6	 11113-16383
    集群C
    10.4.7.222:7002 4	 3279-6556
    10.4.7.221:7001 14	 0-3278
    10.4.7.222:7014 9	
    10.4.7.221:7013 10	 6557-9834
    10.4.7.223:9003 15	
    10.4.7.221:7004 9	 9835-13112
    10.4.7.223:8003 0	
    10.4.7.222:7005 3	 13113-16383
    10.4.7.222:7011 14	
    10.4.7.223:7003 16	
    
    集群C 10.4.7.221:7001 14	 0-3278
    集群A 10.4.7.221:8001 1	 0-5555   => 0-3278的最大纪元实例为 集群C 10.4.7.221:7001
    
    集群C 10.4.7.222:7002 4	 3279-6556
    集群A 10.4.7.221:8001 1	 0-5555
    集群A 10.4.7.222:8002 3	 5556-11112 => 3279-6556的最大纪元实例 为集群C 10.4.7.222:7002
    
    集群C 10.4.7.221:7013 10	 6557-9834
    集群A 10.4.7.222:8002 3	 5556-11112 => 6557-9834的最大纪元实例为 集群C 10.4.7.221:7013
    
    集群C 10.4.7.221:7004 9	 9835-13112
    集群A 10.4.7.221:8013 6	 11113-16383
    集群A 10.4.7.222:8002 3	 5556-11112 => 9835-13112的最大纪元实例为 集群C 10.4.7.221:7004
    
    集群C 10.4.7.222:7005 3	 13113-16383
    集群A 10.4.7.221:8013 6	 11113-16383 => 13112-16383的最大纪元实例为 集群A 10.4.7.221:8013
    
    最终的结果与集群胡乱后的槽位分布契合
    10.4.7.221:7001 0-3278
    10.4.7.222:7002 3279-6556
    10.4.7.221:7013 6557-9834
    10.4.7.221:7004 9835-13112
    10.4.7.221:8013 13113-16383
    
    从数据上看也是 0-13112槽位的数据都是原集群C的数据,而13113-16383的数据都是原集群A的数据。
    
    
  7. 从数据上来看,相同的槽位会被高纪元的实例的槽位进行覆盖,所以应该不回出现脏数据的情况。

集群混乱的情况也并非每次都能复现,最开始我认为是只有关机后迁移的情况才能复现,后来关机后迁移也并不能够稳定的复现。目前我的猜测是,高纪元值的实例关闭后启动才会被原集群进行召回,因为每次测试的不确定因素就是这个纪元。

以上的结果,也只是这次执行的复现结果,在之前的几次操作流程中,还出现了丢槽位,槽位截断等现象,和集群和实例的纪元信息和槽位决策关联起来,就很好理解了。

复现过程要注意的几个点:

  1. 机器开机后要注意关闭防火墙
  2. 启动实例以后不能太快操作meet和replicate,否则可能还来不及原集群发现该实例,新集群就已经操作成功了(也有可能是启动的实例在原集群的实例的纪元值太小所以不进行实例召回?暂不能确定)
  3. 关机或许并不能触发混乱的必要条件

问题出现的原因分析:

  1. 我们在做整机克隆时,在第一版的时候是没有做整个集群forget宕机实例的,第二版是做了的。但是第二版对整个集群执行了对宕机实例的forget,仍然有集群保留着宕机实例的fail信息,在cluster forget命令的文档中可以看到,cluster forget 这个命令必须在60秒内在所有的实例中执行,否则可能删除不掉集群信息中的宕机实例信息。而在宕机情况下,调用api连接宕机实例执行命令会超时返回无法执行,导致所有实例执行的时间超过60秒,然后集群信息中无法消除掉集群中的宕机实例信息。

    命令执行详细
    假设我们有四个节点:A,B,C,D。为了得到一个三节点群集A,B,C,我们可以做如下操作:

    1. 将D上的哈希槽重分配到节点A,B,C。
    2. 节点D现在已经空了,但是节点A,B,C的节点信息表中仍然有D的信息
    3. 我们连接节点A,发送命令CLUSTER FORGET D。
    4. 节点B发送心跳包给节点A,包含节点D的信息。
    5. 节点A无节点D信息,无法识别节点D(参见步骤3),因此开始与节点D握手。
    6. 节点D最终再次添加进节点A的节点信息表中

    上述的移除方法很不稳定,因此我们需要尽快发送命令CLUSTER FORGET 给所有节点,以期没有gossip sections在同时处理。 因为这个原因,命令CLUSTER FORGET为每个节点实现了包含超时时间的禁止列表
    因此我们命令实际的执行情况如下:

    1. 从收到命令节点的节点信息列表中删除待删除节点的节点信息。
    2. 已删除的节点的节点ID被加入禁止列表,保留1分钟
    3. 收到命令的节点,在处理从其他节点发送过来的gossip sections 会跳过所有在禁止列表中的节点。

    这样,我们就有60秒的时间窗口来通知群集中的所有节点,我们想要删除某个节点。

  2. 该问题出现的场景是A服务器宕机,原A服务器的实例迁移到B服务服务器,然后A服务器维修好以后,回归资源池作为备用机,在C服务器宕机以后,将C服务器的实例迁移到A服务器时,正好C服务器上A1集群的中包含原来A服务器上的迁移走但是没forget掉的ip:port信息的实例。因为克隆功能已经上线很长时间了,以前并没有出现这种情况,以前克隆都是往新服务器上面克隆,而这次克隆是往以前宕机回收的机器上面克隆。所以发生了这次的情况。

解决方案:

  1. 定时检查集群信息,定时发现和清除fail状态的信息。
  2. 调整集群端口生成规则的最小值,大与现所有集群的实例的最大端口,避免新机器创建的新实例的端口与已有端口冲突。
  3. 宕机修复的机器修改ip。
  4. 执行cluster forget时不要直接使用for循环按序执行,而应该开多线程在60秒内执行完集群所有实例的cluster forget操作。

参考资料:

  1. Redis 集群中的纪元(epoch)
  2. CLUSTER FORGET node-id
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值