Redis集群中的一些问题,slots,forget,reset,meet

本文记录了在Redis集群遇到的问题及解决过程,包括lua脚本连接失败、集群状态为fail、槽位缺失、节点信息混乱、数据丢失等问题,通过cluster forget、reset、meet等命令进行修复,并最终成功恢复集群正常运行。
1、下面是lua脚本连接redis集群,Nginx下log中的报错
2020/02/29 10:12:35 [notice] 621#0: signal process started
2020/02/29 10:12:42 [error] 32231#0: *438 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET / HTTP/1.1", host: "slave1:8080"
2020/02/29 10:12:42 [error] 32231#0: *438 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET / HTTP/1.1", host: "slave1:8080"
2020/02/29 10:12:42 [error] 622#0: *453 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "slave1:8080", referrer: "http://slave1:8080/"
2020/02/29 10:12:42 [error] 622#0: *453 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "slave1:8080", referrer: "http://slave1:8080/"

没错集群连接失败

2、查看redis服务都开启了
[hadoop@master redis-cluster]$ ps -ef | grep redis
hadoop   12538  9166  0 10:14 pts/2    00:00:00 grep redis
hadoop   16086     1  0 09:44 ?        00:00:03 ./redis-server *:7001 [cluster]
hadoop   16088     1  0 09:44 ?        00:00:03 ./redis-server *:7002 [cluster]
hadoop   16094     1  0 09:44 ?        00:00:03 ./redis-server *:7003 [cluster]
hadoop   16098     1  0 09:44 ?        00:00:03 ./redis-server *:7004 [cluster]
hadoop   16102     1  0 09:44 ?        00:00:03 ./redis-server *:7005 [cluster]
hadoop   16106     1  0 09:44 ?        00:00:03 ./redis-server *:7006 [cluster]

3、执行命令查看集群信息,显示 cluster_state:fail ,集群不好使
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 cluster info
4、查看节点信息,fail,noaddr,一定是我之前玩的时候,被玩乱了
127.0.0.1:7001> cluster nodes
cd1d5f2d601ffc6df5f01adb413f8646b1a501bf :0 master,fail,noaddr - 1582944656654 1582944656654 7 disconnected 0-165 5461-5627 10923-11088
a84f12f0990bcc419bcf16685fa60a26244d289b :0 slave,noaddr cd1d5f2d601ffc6df5f01adb413f8646b1a501bf 1582944656655 1582944656654 7 disconnected
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582945142217 4 connected
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582945144229 3 connected 11089-16383
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582945143223 6 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 master - 0 1582945145236 2 connected 5628-10922
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 myself,master - 0 0 1 connected 166-5460
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582945144229 5 connected
5、执行cluster forget 删除吧
127.0.0.1:7001> cluster forget cd1d5f2d601ffc6df5f01adb413f8646b1a501b
...
...
  • 删完后成这样了
127.0.0.1:7001> cluster nodes
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582945467959 4 connected
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582945466954 3 connected 11089-16383
6f6b90e530270e4e70a908c4cf7957a5d16f9c16 127.0.0.1:7007 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582945465950 3 connected
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582945468965 6 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 master - 0 1582945464942 2 connected 5628-10922
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 myself,master - 0 0 1 connected 166-5460
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582945466451 5 connected

6、 重启Redis集群,ngix 依然报错
2020/02/27 08:57:25 [error] 14159#0: *119 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET / HTTP/1.1", host: "slave1:8080"
7、再次查看Redis集群信息

在这里插入图片描述

8、槽位得有16384个,我的缺了很多

正常的这样

cluster_slots_assigned:16384
cluster_slots_ok:16384
9、接下来就添加槽位吧
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 
127.0.0.1:7001> cluster addslots {5461..5627}
(error) ERR Invalid or out of range slot

不能这么做。。。
神奇的是这里这种方法一个个添加可以
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 cluster addslots {5461..5627}
OK
这样添加就可以了
10、哪个上修改了记得刷新下
[hadoop@master redis01]$ ./redis-cli -h 10.211.55.200 -p 7001 flushall
OK
11、再次查看,OK,槽位添加成功,集群起来了

[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:1
cluster_stats_messages_sent:4494
cluster_stats_messages_received:6943

12、测试下添加数据吧

[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001
127.0.0.1:7001> keys *
(empty list or set)
127.0.0.1:7001> set a 1
(error) MOVED 15495 127.0.0.1:7003

解决方案
启动时使用-c参数来启动集群模式,命令如下:

[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -c -p 7001 
13、 再次添加数据,报错,集群被shutdown了
  • check 一下 ,错误如下
    [hadoop@master redis-cluster]$ ./redis-trib.rb check 10.211.55.200:7001
    Connecting to node 10.211.55.200:7001: OK
    Connecting to node 127.0.0.1:7003: OK
    Connecting to node 127.0.0.1:7005: OK
    Connecting to node 127.0.0.1:7002: OK
    Connecting to node 127.0.0.1:7006: OK
    Connecting to node 127.0.0.1:7004: OK

Performing Cluster Check (using node 10.211.55.200:7001)
M: ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 10.211.55.200:7001
slots:0-5627,10923-11088 (5794 slots) master
1 additional replica(s)
M: 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003
slots:11089-16383 (5295 slots) master
1 additional replica(s)
S: ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005
slots: (0 slots) slave
replicates 7dfde834ea1ef108c081f953116fb35e30efcfa7
M: 7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002
slots:5628-10922 (5295 slots) master
1 additional replica(s)
S: 3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006
slots: (0 slots) slave
replicates 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15
S: 923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004
slots: (0 slots) slave
replicates ff6b964ab4ec4419ab55fce1df7ed403ec1268ff
[ERR] Nodes don’t agree about configuration!

Check for open slots…
Check slots coverage…
[OK] All 16384 slots covered.

14、原来是之前cluster forget时,操作了redis01那个节点,其他节点的nodes中还有noaddr信息的,如下对比
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7001 cluster nodes
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582951896674 3 connected 11089-16383
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582951898687 5 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 master - 0 1582951897681 2 connected 5628-10922
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582951895668 6 connected
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 myself,master - 0 0 1 connected 0-5627 10923-11088
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582951893659 4 connected



[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7002 cluster nodes
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 master - 0 1582951903711 1 connected 166-5460
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582951906730 6 connected
a84f12f0990bcc419bcf16685fa60a26244d289b :0 slave,noaddr cd1d5f2d601ffc6df5f01adb413f8646b1a501bf 1582951488414 1582951488414 7 disconnected
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582951906730 3 connected 11089-16383
b29d13206033606487e51a96243691b709cf45b4 :0 master,noaddr - 1582951488414 1582951488414 0 disconnected
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582951905725 4 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 myself,master - 0 0 2 connected 5628-10922
cd1d5f2d601ffc6df5f01adb413f8646b1a501bf :0 master,fail,noaddr - 1582951488414 1582951488414 7 disconnected 0-165 5461-5627 10923-11088
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582951907738 5 connected
15、 清除其他节点的数据
hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7003 cluster reset
OK
...
  • 清除后如下
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7002 cluster nodes
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7002 myself,master - 0 0 3 connected
...
16、重新握手,Redis01 即7001有节点的握手信息,所以其他节点都和他握一下手
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7001 cluster meet 10.211.55.200 7002
OK
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7002 cluster nodes
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 master - 0 1582952153197 6 connected
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 10.211.55.200:7001 master - 0 1582952155209 1 connected 0-5627 10923-11088
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582952152292 3 connected
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 master - 0 1582952154203 4 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 10.211.55.200:7002 myself,master - 0 0 2 connected
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 master - 0 1582952155208 5 connected

17、查看集群信息
[hadoop@master redis01]$ ./redis-cli -h 10.211.55.200 -p 7001 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:1
cluster_stats_messages_sent:1577
cluster_stats_messages_received:2691

成功,集群没问题了,设置key,执行lua 脚本也出来结果了

扩展

  • cluster nodes显示的每一行信息,由下面的字段组成。

ip:port …

  • 字段的含义如下:
  1. id: 节点ID,一个40字节的随机字符串,节点创建时生成,且不会变化(除非使用CLUSTER RESET HARD命令)。

  2. ip:port: 客户端访问的地址。

  3. flags: 逗号分隔的标记位,可能值有:myself, master, slave, fail?, fail, handshake, noaddr, noflags。

  4. master: 若是已知master节点的slave,这里出现的是master的节点ID,否则是"-"。

  5. ping-sent: 最近一次发送ping的unix毫秒时间戳,0代表没有发送过。

  6. pong-recv: 最近一次收到pong的unix毫秒时间戳。

  7. config-epoch: 该节点或其master节点的epoch值。每次故障转移都会生成一个新的,唯一的,递增的epoch值。若多个节点竞争相同的slot,epoch值大的获胜。

  8. link-state: 节点和集群总线间的连接状态,可以是connected或disconnected。

  9. slot: 该节点负责的slot。

  • flags字段各标记含义如下:

    myself: 当前连接的节点。

    master: 节点是master。

    slave: 节点是slave。

    fail?: 节点处于pfail状态,当前节点无法和其联系,但其它节点可以。

    fail: 节点处于fail状态,大多数节点都无法和其联系,将其由pfail升级到fail状态。

    handshake: 还没完全加入集群,正在握手阶段。

    noaddr: 不知道节点地址。

    noflags: 没有任何标记。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值