故障名称
Redis集群故障的处理过程
故障发生时间
2020年4月1日15时
故障描述
1、客服人员反映用户端无法访问相关接口。
2、研发人员反馈业务日志如下报错:
redis.clients.jedis.exceptions.JedisException: Could not get a resource from the pool
at redis.clients.util.Pool.getResource(Pool.java:51)
at redis.clients.jedis.JedisPool.getResource(JedisPool.java:226)
at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnectionFromSlot(JedisSlotBasedConnectionHandler.java:66)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:116)
at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:31)
at redis.clients.jedis.JedisCluster.llen(JedisCluster.java:544)
at cn.com.dhc.service.impl.JedisClusterServiceImpl.redisToDb(JedisClusterServiceImpl.java:64)
at cn.com.dhc.service.impl.TimeServiceImpl.getRedisToDB(TimeServiceImpl.java:212)
at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:81)
at java.util.concurrent.Executors
R
u
n
n
a
b
l
e
A
d
a
p
t
e
r
.
c
a
l
l
(
E
x
e
c
u
t
o
r
s
.
j
a
v
a
:
511
)
a
t
j
a
v
a
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
F
u
t
u
r
e
T
a
s
k
.
r
u
n
(
F
u
t
u
r
e
T
a
s
k
.
j
a
v
a
:
266
)
a
t
j
a
v
a
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
S
c
h
e
d
u
l
e
d
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor
RunnableAdapter.call(Executors.java:511)atjava.util.concurrent.FutureTask.run(FutureTask.java:266)atjava.util.concurrent.ScheduledThreadPoolExecutorScheduledFutureTask.access
201
(
S
c
h
e
d
u
l
e
d
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
.
j
a
v
a
:
180
)
a
t
j
a
v
a
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
S
c
h
e
d
u
l
e
d
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor
201(ScheduledThreadPoolExecutor.java:180)atjava.util.concurrent.ScheduledThreadPoolExecutorScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: Unable to validate object
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:502)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361)
at redis.clients.util.Pool.getResource(Pool.java:49)
… 20 more
redis出现异常
故障分析
1、首先查看redis集群状态。
192.168.1.16:6383> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:4
cluster_stats_messages_ping_sent:40113208
cluster_stats_messages_pong_sent:20808718
cluster_stats_messages_meet_sent:4
cluster_stats_messages_fail_sent:4
cluster_stats_messages_sent:60921934
cluster_stats_messages_ping_received:20808716
cluster_stats_messages_pong_received:21350562
cluster_stats_messages_meet_received:2
cluster_stats_messages_fail_received:1
cluster_stats_messages_received:42159281
集群状态是正常。
2、继续查看redis集群节点的状态。
192.168.1.16:6383> cluster nodes
3a96d36afc530e96dd461221ca4cb29ff1ab8fd1 192.168.1.19:6381@16381 master - 0 1585726150005 2 connected 10923-16383
017696247b87dfe42f3fb6f8ba0529beede46bf2 192.168.1.19:6380@16380 master - 0 1585726150000 1 connected 0-5460
e909012b4346d46cc0d5c92a4f339ad2f24440e4 :0@0 slave,fail,noaddr 6ba5bd60cc50e7591a5105f489c20dca6c35a169 1575874315684 15758743130004 disconnected
bae5852f451b62dbb7af3f3f87013757ed3c86c0 192.168.1.16:6384@16384 slave 3a96d36afc530e96dd461221ca4cb29ff1ab8fd1 0 1585726151006 5 connected
6ba5bd60cc50e7591a5105f489c20dca6c35a169 192.168.1.16:6383@16383 myself,master - 0 1585726148000 4 connected 5461-10922
0ae0ffea6fb35f9a6711d4a0eb9a9fe34b5476d6 192.168.1.16:6385@16385 slave 017696247b87dfe42f3fb6f8ba0529beede46bf2 0 1585726152009 6 connected
发现其中1个redis从节点是fail的状态。说明一个从节点有问题。
3、至此判断是redis集群出了问题。
处理过程
1、先停止所有redis节点。
2、删除每个节点的缓存文件,包括node-6380.conf dump.rdp等文件。
3、重启每个redis节点。
4、重新创建redis集群。
经验总结
通过zabbix添加对redis的监控,如对redis端口的监控等。