redis cluster 因为aof导致cluster down

1.业务背景

2现象:

redis 日志中出现

3963:S 28 Jul 12:26:30.030 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 12:37:18.048 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 12:40:25.080 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 13:11:51.146 * FAIL message received from ae3213272c3bf10556a4798d73b2414cb4e2e78f about 3c1067d381a504dacc86766b349739c8c9e0ae5a
3963:S 28 Jul 13:11:52.306 * Clear FAIL state for node 3c1067d381a504dacc86766b349739c8c9e0ae5a: slave is reachable again.

其中3963:S 28 Jul 12:37:18.048 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.是因为redis执行write的时候发生了阻塞,导致redis主进程阻塞,之后不会接受任何命令请求,其中包括集群相关通信,同时redis cluster 每个节点通过gossip协议广播失败信息,让其它节点收到这个消息,从而导致redis 进行投票,redis cluster重新rebalance.

4.原理:

付磊大神:https://carlosfu.iteye.com/blog/2259482

AOF设计原理:https://redisbook.readthedocs.io/en/latest/internal/aof.html

5.解决办法

  1. 设置cluster-node-timeout 参数为15s,解决node 网络延时问题。
  2. 关闭aof(如果业务系统有别的db来保存信息的话)或者设置aof 模式AOF_FSYNC_ALWAYS即设置参数appendfsync具体设置appendfsync设置
  3. 设置系统参数vm.dirty_background_ratio=10 (未完全理解,带深入研究redis源码)

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值