故障处理——Redis集群故障的处理过程

故障名称

Redis集群故障的处理过程

故障发生时间

2020年4月1日15时

故障描述

1、客服人员反映用户端无法访问相关接口。
2、研发人员反馈业务日志如下报错:
redis.clients.jedis.exceptions.JedisException: Could not get a resource from the pool
at redis.clients.util.Pool.getResource(Pool.java:51)
at redis.clients.jedis.JedisPool.getResource(JedisPool.java:226)
at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnectionFromSlot(JedisSlotBasedConnectionHandler.java:66)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:116)
at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:31)
at redis.clients.jedis.JedisCluster.llen(JedisCluster.java:544)
at cn.com.dhc.service.impl.JedisClusterServiceImpl.redisToDb(JedisClusterServiceImpl.java:64)
at cn.com.dhc.service.impl.TimeServiceImpl.getRedisToDB(TimeServiceImpl.java:212)
at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:81)
at java.util.concurrent.Executors R u n n a b l e A d a p t e r . c a l l ( E x e c u t o r s . j a v a : 511 ) a t j a v a . u t i l . c o n c u r r e n t . F u t u r e T a s k . r u n ( F u t u r e T a s k . j a v a : 266 ) a t j a v a . u t i l . c o n c u r r e n t . S c h e d u l e d T h r e a d P o o l E x e c u t o r RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor RunnableAdapter.call(Executors.java:511)atjava.util.concurrent.FutureTask.run(FutureTask.java:266)atjava.util.concurrent.ScheduledThreadPoolExecutorScheduledFutureTask.access 201 ( S c h e d u l e d T h r e a d P o o l E x e c u t o r . j a v a : 180 ) a t j a v a . u t i l . c o n c u r r e n t . S c h e d u l e d T h r e a d P o o l E x e c u t o r 201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor 201(ScheduledThreadPoolExecutor.java:180)atjava.util.concurrent.ScheduledThreadPoolExecutorScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: Unable to validate object
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:502)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361)
at redis.clients.util.Pool.getResource(Pool.java:49)
… 20 more
redis出现异常

故障分析

1、首先查看redis集群状态。
192.168.1.16:6383> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:4
cluster_stats_messages_ping_sent:40113208
cluster_stats_messages_pong_sent:20808718
cluster_stats_messages_meet_sent:4
cluster_stats_messages_fail_sent:4
cluster_stats_messages_sent:60921934
cluster_stats_messages_ping_received:20808716
cluster_stats_messages_pong_received:21350562
cluster_stats_messages_meet_received:2
cluster_stats_messages_fail_received:1
cluster_stats_messages_received:42159281
集群状态是正常。
2、继续查看redis集群节点的状态。
192.168.1.16:6383> cluster nodes
3a96d36afc530e96dd461221ca4cb29ff1ab8fd1 192.168.1.19:6381@16381 master - 0 1585726150005 2 connected 10923-16383
017696247b87dfe42f3fb6f8ba0529beede46bf2 192.168.1.19:6380@16380 master - 0 1585726150000 1 connected 0-5460
e909012b4346d46cc0d5c92a4f339ad2f24440e4 :0@0 slave,fail,noaddr 6ba5bd60cc50e7591a5105f489c20dca6c35a169 1575874315684 15758743130004 disconnected
bae5852f451b62dbb7af3f3f87013757ed3c86c0 192.168.1.16:6384@16384 slave 3a96d36afc530e96dd461221ca4cb29ff1ab8fd1 0 1585726151006 5 connected
6ba5bd60cc50e7591a5105f489c20dca6c35a169 192.168.1.16:6383@16383 myself,master - 0 1585726148000 4 connected 5461-10922
0ae0ffea6fb35f9a6711d4a0eb9a9fe34b5476d6 192.168.1.16:6385@16385 slave 017696247b87dfe42f3fb6f8ba0529beede46bf2 0 1585726152009 6 connected
发现其中1个redis从节点是fail的状态。说明一个从节点有问题。
3、至此判断是redis集群出了问题。

处理过程

1、先停止所有redis节点。
2、删除每个节点的缓存文件,包括node-6380.conf dump.rdp等文件。
3、重启每个redis节点。
4、重新创建redis集群。

经验总结

通过zabbix添加对redis的监控,如对redis端口的监控等。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值