十分惊恐,redis 集群重启

redis集群重启,差点酿出大故障

先说下正常的重启,在复盘下出现问题的原因

1.Redis cluster 重启

通过cluster nodes 查看当前集群的运行情况

172.16.135.1:7004> cluster nodes
3cf4888935ddf1ed758872fa4996dff57533d88c 172.16.135.1:7002@17002 master - 0 1614741062129 2 connected 10923-16383
34e4591b5684f4a91eaf6e23359d3572a9d3374d 172.16.135.1:7006@17006 slave 3cf4888935ddf1ed758872fa4996dff57533d88c 0 1614741058122 2 connected
9a1eb8a0cc7aa97d0e437d0d420b483be727e927 172.16.135.1:7001@17001 master - 0 1614741061128 1 connected 0-5460
b38824033aa9599f57969d5cc7078eacca721bee 172.16.135.1:7003@17003 master - 0 1614741060125 8 connected 5461-10922
a8010691b4cdc166a211f114364562447ea89e98 172.16.135.1:7004@17004 myself,slave b38824033aa9599f57969d5cc7078eacca721bee 0 1614741058000 4 connected
f66f51c4b09404994d2bd18fe1a1cf0d16c97411 :0@0 slave,fail,noaddr 9a1eb8a0cc7aa97d0e437d0d420b483be727e927 1614740749268 1614740746466 5 disconnected

关闭每一个node 节点,链接单个节点后 shutdown

./redis-cli -c -h 172.16.135.1 -p 7005

172.16.135.1:7005> shutdown
not connected> 

删除节点下的 文件

rm -rf appendonly.aof  dump.rdb nodes.conf

启动每个redis节点服务

 redis-server redis_172.16.135.1_7001_master.conf 

将节点加入集群,建立主从关系

 ./redis-trib.rb create --replicas 1 172.16.135.1:7001 172.16.135.1:7002 172.16.135.1:7003 172.16.135.1:7004 172.16.135.1:7005 172.16.135.1:7006
[root@izbp19ujl2isnidre4orygz redis_cluster]# ./redis-trib.rb create --replicas 1 172.16.135.1:7001 172.16.135.1:7002 172.16.135.1:7003 172.16.135.1:7004 172.16.135.1:7005 172.16.135.1:7006
>>> Creating cluster
>>> Performing hash slots allocation on 6 nodes...
Using 3 masters:
172.16.135.1:7001
172.16.135.1:7004
172.16.135.1:7002
Adding replica 172.16.135.1:7005 to 172.16.135.1:7001
Adding replica 172.16.135.1:7003 to 172.16.135.1:7004
Adding replica 172.16.135.1:7006 to 172.16.135.1:7002
M: f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd 172.16.135.1:7001
   slots:0-5460 (5461 slots) master
M: 7ef7d03402e5fd5c62b40df824d6cfd744731a35 172.16.135.1:7002
   slots:10923-16383 (5461 slots) master
S: a9ccfefc428ab332c7385bdcb068d5c47e37883f 172.16.135.1:7003
   replicates 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223
M: 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223 172.16.135.1:7004
   slots:5461-10922 (5462 slots) master
S: 309e053c490f755b31b6782705bec22158c3d191 172.16.135.1:7005
   replicates f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd
S: 601922f40deb116ad970b8797f775145511f5848 172.16.135.1:7006
   replicates 7ef7d03402e5fd5c62b40df824d6cfd744731a35
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join...
>>> Performing Cluster Check (using node 172.16.135.189:7001)
M: f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd 172.16.135.19:7001
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
M: 7ef7d03402e5fd5c62b40df824d6cfd744731a35 172.16.135.18:7002
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
S: 601922f40deb116ad970b8797f775145511f5848 172.16.135.10:7006
   slots: (0 slots) slave
   replicates 7ef7d03402e5fd5c62b40df824d6cfd744731a35
M: 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223 172.16.135.10:7004
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: a9ccfefc428ab332c7385bdcb068d5c47e37883f 172.16.135.19:7003
   slots: (0 slots) slave
   replicates 9c65fc9ccd93bf9b9d404e484930cc6e7ad21223
S: 309e053c490f755b31b6782705bec22158c3d191 172.16.135.10:7005
   slots: (0 slots) slave
   replicates f35d7df55f7ceeff0a8cc1b39a726f6d34b5c6dd
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered

2.出现的问题

如果有节点没有关闭,或者关了没有删除配置文件,就容易停留在:Waiting for the cluster to join

>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

如果节点关闭了,也删除了配置文件,但是没有启动,创建集群会提示连接不上

>>> Creating cluster
[ERR] Sorry, can't connect to node 172.16.135.1:7001

3.保留数据的重启

单个的关闭,单个的重启,按道理应该是可以恢复的

这是正常的:

172.16.135.1:7004> cluster nodes
3cf4888935ddf1ed758872fa4996dff57533d88c 172.16.135.1:7002@17002 master - 0 1614740771520 2 connected 10923-16383
34e4591b5684f4a91eaf6e23359d3572a9d3374d 172.16.135.1:7006@17006 slave 3cf4888935ddf1ed758872fa4996dff57533d88c 0 1614740770016 2 connected
9a1eb8a0cc7aa97d0e437d0d420b483be727e927 172.16.135.1:7001@17001 master - 0 1614740772000 1 connected 0-5460
b38824033aa9599f57969d5cc7078eacca721bee 172.16.135.1:7003@17003 master - 0 1614740772523 8 connected 5461-10922
a8010691b4cdc166a211f114364562447ea89e98 172.16.135.1:7004@17004 myself,slave b38824033aa9599f57969d5cc7078eacca721bee 0 1614740768000 4 connected
f66f51c4b09404994d2bd18fe1a1cf0d16c97411 172.16.135.1:7005@17005 slave,fail 9a1eb8a0cc7aa97d0e437d0d420b483be727e927 1614740749268 1614740746466 5 disconnected

这是重启了一个节点的,重启了7005端口就成了这样:

172.16.135.1:7004> cluster nodes
3cf4888935ddf1ed758872fa4996dff57533d88c 172.16.135.1:7002@17002 master - 0 1614740929000 2 connected 10923-16383
34e4591b5684f4a91eaf6e23359d3572a9d3374d 172.16.135.1:7006@17006 slave 3cf4888935ddf1ed758872fa4996dff57533d88c 0 161474092430 2 connected
9a1eb8a0cc7aa97d0e437d0d420b483be727e927 172.16.135.1:7001@17001 master - 0 1614740930000 1 connected 0-5460
b38824033aa9599f57969d5cc7078eacca721bee 172.16.135.1:7003@17003 master - 0 1614740930902 8 connected 5461-10922
a8010691b4cdc166a211f114364562447ea89e98 172.16.135.1:7004@17004 myself,slave b38824033aa9599f57969d5cc7078eacca721bee 0 1614740920000 4 connected
f66f51c4b09404994d2bd18fe1a1cf0d16c97411 :0@0 slave,fail,noaddr 9a1eb8a0cc7aa97d0e437d0d420b483be727e927 1614740749268 1614740746466 5 disconnected

当时也是搞不清楚为什么,按照配置,7005重启后,恢复了应该自动加入集群,成为7002的slave .

可能这里不会自动加入集群,需要手动配置一下?

当时太慌了,就直接给全部删除了,全部shutdown了,再重新创建集群

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值