redis集群重启,差点酿出大故障
先说下正常的重启,在复盘下出现问题的原因
1.Redis cluster 重启
通过cluster nodes 查看当前集群的运行情况
172.16.135.1:7004> cluster nodes
3cf4888935ddf1ed758872fa4996dff57533d88c 172.16.135.1:7002@17002 master - 0 1614741062129 2 connected 10923-16383
34e4591b5684f4a91eaf6e23359d3572a9d3374d 172.16.135.1:7006@17006 slave 3cf4888935ddf1ed758872fa4996dff57533d88c 0 1614741058122 2 connected
9a1eb8a0cc7aa97d0e437d0d420b483be727e927 172.16.135.1:7001@17001 master - 0 1614741061128 1 connected 0-5460
b38824033aa9599f57969d5cc7078eacca721bee 172.16.135.1:7003@17003 master - 0 1614741060125 8 connected 5461-10922
a8010691b4cdc166a211f114364562447ea89e98 172.16.135.1:7004@17004 myself,slave b38824033aa9599f57969d5cc7078eacca721bee 0 1614741058000 4 connected
f66f51c4b09404994d2bd18fe1a1cf0d16c97411 :0@0 slave,fail,noaddr 9a1eb8a0cc7aa97d0e437d0d420b483be727e927 1614740749268 1614740746466 5 disconnected
关闭每一个node 节点,链接单个节点后 shutdown
./redis-cli -c -h 172.16.135.1 -p 7005
172.16.135.1:7005> shutdown
not connected>
删除节点下的 文件
rm -rf appendonly.aof dump.rdb nodes.conf
启动每个redis节点服务
redis-server redis_172.16.135.1_7001_master.conf
将节点加入集群,建立主从关系
./redis-trib.rb create --replicas 1 172.16.135.1:7001 172.16.135.1:7002 172.16.135.1:7003 172.16.135.1:7004 172.16.135.1:7005 172.16.135.1:7006
[root@izbp19ujl2isnidre4orygz redis_cluster]# ./redis-trib.rb create --replicas 1 172.16.135.1:7001 172.16.135.1:7002 172.16.135.1:7003 172.16.135.1:7004 172.16.135.1:7005 172.16.135.1:7006
>>> Creating cluster
>>> Performing hash slots allocation on 6 nodes...
Using 3 masters:
172.16.135.1:7001
172.16.135.1:7004
172.16.135.1:7002
Adding replica 172.16.135.1:7005 to 172.16.135.1:7001
Adding replica 172.16.135.1:7003 to 172.16.135.1:7004
Adding replica 172.16.135.1:7006 to 172.16.135.1:7002
M: f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd 172.16.135.1:7001
slots:0-5460 (5461 slots) master
M: 7ef7d03402e5fd5c62b40df824d6cfd744731a35 172.16.135.1:7002
slots:10923-16383 (5461 slots) master
S: a9ccfefc428ab332c7385bdcb068d5c47e37883f 172.16.135.1:7003
replicates 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223
M: 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223 172.16.135.1:7004
slots:5461-10922 (5462 slots) master
S: 309e053c490f755b31b6782705bec22158c3d191 172.16.135.1:7005
replicates f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd
S: 601922f40deb116ad970b8797f775145511f5848 172.16.135.1:7006
replicates 7ef7d03402e5fd5c62b40df824d6cfd744731a35
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join...
>>> Performing Cluster Check (using node 172.16.135.189:7001)
M: f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd 172.16.135.19:7001
slots:0-5460 (5461 slots) master
1 additional replica(s)
M: 7ef7d03402e5fd5c62b40df824d6cfd744731a35 172.16.135.18:7002
slots:10923-16383 (5461 slots) master
1 additional replica(s)
S: 601922f40deb116ad970b8797f775145511f5848 172.16.135.10:7006
slots: (0 slots) slave
replicates 7ef7d03402e5fd5c62b40df824d6cfd744731a35
M: 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223 172.16.135.10:7004
slots:5461-10922 (5462 slots) master
1 additional replica(s)
S: a9ccfefc428ab332c7385bdcb068d5c47e37883f 172.16.135.19:7003
slots: (0 slots) slave
replicates 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223
S: 309e053c490f755b31b6782705bec22158c3d191 172.16.135.10:7005
slots: (0 slots) slave
replicates f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered
2.出现的问题
如果有节点没有关闭,或者关了没有删除配置文件,就容易停留在:Waiting for the cluster to join
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
如果节点关闭了,也删除了配置文件,但是没有启动,创建集群会提示连接不上
>>> Creating cluster
[ERR] Sorry, can't connect to node 172.16.135.1:7001
3.保留数据的重启
单个的关闭,单个的重启,按道理应该是可以恢复的
这是正常的:
172.16.135.1:7004> cluster nodes
3cf4888935ddf1ed758872fa4996dff57533d88c 172.16.135.1:7002@17002 master - 0 1614740771520 2 connected 10923-16383
34e4591b5684f4a91eaf6e23359d3572a9d3374d 172.16.135.1:7006@17006 slave 3cf4888935ddf1ed758872fa4996dff57533d88c 0 1614740770016 2 connected
9a1eb8a0cc7aa97d0e437d0d420b483be727e927 172.16.135.1:7001@17001 master - 0 1614740772000 1 connected 0-5460
b38824033aa9599f57969d5cc7078eacca721bee 172.16.135.1:7003@17003 master - 0 1614740772523 8 connected 5461-10922
a8010691b4cdc166a211f114364562447ea89e98 172.16.135.1:7004@17004 myself,slave b38824033aa9599f57969d5cc7078eacca721bee 0 1614740768000 4 connected
f66f51c4b09404994d2bd18fe1a1cf0d16c97411 172.16.135.1:7005@17005 slave,fail 9a1eb8a0cc7aa97d0e437d0d420b483be727e927 1614740749268 1614740746466 5 disconnected
这是重启了一个节点的,重启了7005端口就成了这样:
172.16.135.1:7004> cluster nodes
3cf4888935ddf1ed758872fa4996dff57533d88c 172.16.135.1:7002@17002 master - 0 1614740929000 2 connected 10923-16383
34e4591b5684f4a91eaf6e23359d3572a9d3374d 172.16.135.1:7006@17006 slave 3cf4888935ddf1ed758872fa4996dff57533d88c 0 161474092430 2 connected
9a1eb8a0cc7aa97d0e437d0d420b483be727e927 172.16.135.1:7001@17001 master - 0 1614740930000 1 connected 0-5460
b38824033aa9599f57969d5cc7078eacca721bee 172.16.135.1:7003@17003 master - 0 1614740930902 8 connected 5461-10922
a8010691b4cdc166a211f114364562447ea89e98 172.16.135.1:7004@17004 myself,slave b38824033aa9599f57969d5cc7078eacca721bee 0 1614740920000 4 connected
f66f51c4b09404994d2bd18fe1a1cf0d16c97411 :0@0 slave,fail,noaddr 9a1eb8a0cc7aa97d0e437d0d420b483be727e927 1614740749268 1614740746466 5 disconnected
当时也是搞不清楚为什么,按照配置,7005重启后,恢复了应该自动加入集群,成为7002的slave .
可能这里不会自动加入集群,需要手动配置一下?
当时太慌了,就直接给全部删除了,全部shutdown了,再重新创建集群