1、下面是lua脚本连接redis集群,Nginx下log中的报错
2020/02/29 10:12:35 [notice] 621#0: signal process started
2020/02/29 10:12:42 [error] 32231#0: *438 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET / HTTP/1.1", host: "slave1:8080"
2020/02/29 10:12:42 [error] 32231#0: *438 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET / HTTP/1.1", host: "slave1:8080"
2020/02/29 10:12:42 [error] 622#0: *453 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "slave1:8080", referrer: "http://slave1:8080/"
2020/02/29 10:12:42 [error] 622#0: *453 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "slave1:8080", referrer: "http://slave1:8080/"
没错集群连接失败
2、查看redis服务都开启了
[hadoop@master redis-cluster]$ ps -ef | grep redis
hadoop 12538 9166 0 10:14 pts/2 00:00:00 grep redis
hadoop 16086 1 0 09:44 ? 00:00:03 ./redis-server *:7001 [cluster]
hadoop 16088 1 0 09:44 ? 00:00:03 ./redis-server *:7002 [cluster]
hadoop 16094 1 0 09:44 ? 00:00:03 ./redis-server *:7003 [cluster]
hadoop 16098 1 0 09:44 ? 00:00:03 ./redis-server *:7004 [cluster]
hadoop 16102 1 0 09:44 ? 00:00:03 ./redis-server *:7005 [cluster]
hadoop 16106 1 0 09:44 ? 00:00:03 ./redis-server *:7006 [cluster]
3、执行命令查看集群信息,显示 cluster_state:fail ,集群不好使
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 cluster info
4、查看节点信息,fail,noaddr,一定是我之前玩的时候,被玩乱了
127.0.0.1:7001> cluster nodes
cd1d5f2d601ffc6df5f01adb413f8646b1a501bf :0 master,fail,noaddr - 1582944656654 1582944656654 7 disconnected 0-165 5461-5627 10923-11088
a84f12f0990bcc419bcf16685fa60a26244d289b :0 slave,noaddr cd1d5f2d601ffc6df5f01adb413f8646b1a501bf 1582944656655 1582944656654 7 disconnected
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582945142217 4 connected
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582945144229 3 connected 11089-16383
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582945143223 6 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 master - 0 1582945145236 2 connected 5628-10922
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 myself,master - 0 0 1 connected 166-5460
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582945144229 5 connected
5、执行cluster forget 删除吧
127.0.0.1:7001> cluster forget cd1d5f2d601ffc6df5f01adb413f8646b1a501b
...
...
- 删完后成这样了
127.0.0.1:7001> cluster nodes
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582945467959 4 connected
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582945466954 3 connected 11089-16383
6f6b90e530270e4e70a908c4cf7957a5d16f9c16 127.0.0.1:7007 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582945465950 3 connected
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582945468965 6 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 master - 0 1582945464942 2 connected 5628-10922
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 myself,master - 0 0 1 connected 166-5460
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582945466451 5 connected
6、 重启Redis集群,ngix 依然报错
2020/02/27 08:57:25 [error] 14159#0: *119 connect() failed (111: Connection refused), client: 10.211.55.2, server: localhost, request: "GET / HTTP/1.1", host: "slave1:8080"
7、再次查看Redis集群信息

8、槽位得有16384个,我的缺了很多
正常的这样
cluster_slots_assigned:16384
cluster_slots_ok:16384
9、接下来就添加槽位吧
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001
127.0.0.1:7001> cluster addslots {5461..5627}
(error) ERR Invalid or out of range slot
不能这么做。。。
神奇的是这里这种方法一个个添加可以
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 cluster addslots {5461..5627}
OK
这样添加就可以了
10、哪个上修改了记得刷新下
[hadoop@master redis01]$ ./redis-cli -h 10.211.55.200 -p 7001 flushall
OK
11、再次查看,OK,槽位添加成功,集群起来了
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:1
cluster_stats_messages_sent:4494
cluster_stats_messages_received:6943
12、测试下添加数据吧
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -p 7001
127.0.0.1:7001> keys *
(empty list or set)
127.0.0.1:7001> set a 1
(error) MOVED 15495 127.0.0.1:7003
解决方案
启动时使用-c参数来启动集群模式,命令如下:
[hadoop@master redis01]$ ./redis-cli -h 127.0.0.1 -c -p 7001
13、 再次添加数据,报错,集群被shutdown了
- check 一下 ,错误如下
[hadoop@master redis-cluster]$ ./redis-trib.rb check 10.211.55.200:7001
Connecting to node 10.211.55.200:7001: OK
Connecting to node 127.0.0.1:7003: OK
Connecting to node 127.0.0.1:7005: OK
Connecting to node 127.0.0.1:7002: OK
Connecting to node 127.0.0.1:7006: OK
Connecting to node 127.0.0.1:7004: OK
Performing Cluster Check (using node 10.211.55.200:7001)
M: ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 10.211.55.200:7001
slots:0-5627,10923-11088 (5794 slots) master
1 additional replica(s)
M: 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003
slots:11089-16383 (5295 slots) master
1 additional replica(s)
S: ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005
slots: (0 slots) slave
replicates 7dfde834ea1ef108c081f953116fb35e30efcfa7
M: 7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002
slots:5628-10922 (5295 slots) master
1 additional replica(s)
S: 3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006
slots: (0 slots) slave
replicates 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15
S: 923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004
slots: (0 slots) slave
replicates ff6b964ab4ec4419ab55fce1df7ed403ec1268ff
[ERR] Nodes don’t agree about configuration!Check for open slots…
Check slots coverage…
[OK] All 16384 slots covered.
14、原来是之前cluster forget时,操作了redis01那个节点,其他节点的nodes中还有noaddr信息的,如下对比
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7001 cluster nodes
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582951896674 3 connected 11089-16383
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582951898687 5 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 master - 0 1582951897681 2 connected 5628-10922
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582951895668 6 connected
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 myself,master - 0 0 1 connected 0-5627 10923-11088
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582951893659 4 connected
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7002 cluster nodes
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 127.0.0.1:7001 master - 0 1582951903711 1 connected 166-5460
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 slave 3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 0 1582951906730 6 connected
a84f12f0990bcc419bcf16685fa60a26244d289b :0 slave,noaddr cd1d5f2d601ffc6df5f01adb413f8646b1a501bf 1582951488414 1582951488414 7 disconnected
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582951906730 3 connected 11089-16383
b29d13206033606487e51a96243691b709cf45b4 :0 master,noaddr - 1582951488414 1582951488414 0 disconnected
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 slave ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 0 1582951905725 4 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 127.0.0.1:7002 myself,master - 0 0 2 connected 5628-10922
cd1d5f2d601ffc6df5f01adb413f8646b1a501bf :0 master,fail,noaddr - 1582951488414 1582951488414 7 disconnected 0-165 5461-5627 10923-11088
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 slave 7dfde834ea1ef108c081f953116fb35e30efcfa7 0 1582951907738 5 connected
15、 清除其他节点的数据
hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7003 cluster reset
OK
...
- 清除后如下
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7002 cluster nodes
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7002 myself,master - 0 0 3 connected
...
16、重新握手,Redis01 即7001有节点的握手信息,所以其他节点都和他握一下手
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7001 cluster meet 10.211.55.200 7002
OK
[hadoop@master bin]$ ./redis-cli -h 10.211.55.200 -p 7002 cluster nodes
3d6862ccb6247007d2fe5c46fb54df8facf734bd 127.0.0.1:7006 master - 0 1582952153197 6 connected
ff6b964ab4ec4419ab55fce1df7ed403ec1268ff 10.211.55.200:7001 master - 0 1582952155209 1 connected 0-5627 10923-11088
3a236c23b07b704dd8cd8f5cffdf5baf8546ff15 127.0.0.1:7003 master - 0 1582952152292 3 connected
923f20f487942dd1e92e4a2a32418ac74ab0afcb 127.0.0.1:7004 master - 0 1582952154203 4 connected
7dfde834ea1ef108c081f953116fb35e30efcfa7 10.211.55.200:7002 myself,master - 0 0 2 connected
ceba363458444c82175944290aa18aca641631a3 127.0.0.1:7005 master - 0 1582952155208 5 connected
17、查看集群信息
[hadoop@master redis01]$ ./redis-cli -h 10.211.55.200 -p 7001 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:1
cluster_stats_messages_sent:1577
cluster_stats_messages_received:2691
成功,集群没问题了,设置key,执行lua 脚本也出来结果了
扩展
- cluster nodes显示的每一行信息,由下面的字段组成。
ip:port …
- 字段的含义如下:
-
id: 节点ID,一个40字节的随机字符串,节点创建时生成,且不会变化(除非使用CLUSTER RESET HARD命令)。
-
ip:port: 客户端访问的地址。
-
flags: 逗号分隔的标记位,可能值有:myself, master, slave, fail?, fail, handshake, noaddr, noflags。
-
master: 若是已知master节点的slave,这里出现的是master的节点ID,否则是"-"。
-
ping-sent: 最近一次发送ping的unix毫秒时间戳,0代表没有发送过。
-
pong-recv: 最近一次收到pong的unix毫秒时间戳。
-
config-epoch: 该节点或其master节点的epoch值。每次故障转移都会生成一个新的,唯一的,递增的epoch值。若多个节点竞争相同的slot,epoch值大的获胜。
-
link-state: 节点和集群总线间的连接状态,可以是connected或disconnected。
-
slot: 该节点负责的slot。
-
flags字段各标记含义如下:
myself: 当前连接的节点。
master: 节点是master。
slave: 节点是slave。
fail?: 节点处于pfail状态,当前节点无法和其联系,但其它节点可以。
fail: 节点处于fail状态,大多数节点都无法和其联系,将其由pfail升级到fail状态。
handshake: 还没完全加入集群,正在握手阶段。
noaddr: 不知道节点地址。
noflags: 没有任何标记。
本文记录了在Redis集群遇到的问题及解决过程,包括lua脚本连接失败、集群状态为fail、槽位缺失、节点信息混乱、数据丢失等问题,通过cluster forget、reset、meet等命令进行修复,并最终成功恢复集群正常运行。
6万+

被折叠的 条评论
为什么被折叠?



