1.业务背景
2现象:
redis 日志中出现
3963:S 28 Jul 12:26:30.030 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 12:37:18.048 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 12:40:25.080 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 13:11:51.146 * FAIL message received from ae3213272c3bf10556a4798d73b2414cb4e2e78f about 3c1067d381a504dacc86766b349739c8c9e0ae5a
3963:S 28 Jul 13:11:52.306 * Clear FAIL state for node 3c1067d381a504dacc86766b349739c8c9e0ae5a: slave is reachable again.
其中3963:S 28 Jul 12:37:18.048 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.是因为redis执行write的时候发生了阻塞,导致redis主进程阻塞,之后不会接受任何命令请求,其中包括集群相关通信,同时redis cluster 每个节点通过gossip协议广播失败信息,让其它节点收到这个消息,从而导致redis 进行投票,redis cluster重新rebalance.
4.原理:
付磊大神:https://carlosfu.iteye.com/blog/2259482
AOF设计原理:https://redisbook.readthedocs.io/en/latest/internal/aof.html
5.解决办法
- 设置cluster-node-timeout 参数为15s,解决node 网络延时问题。
- 关闭aof(如果业务系统有别的db来保存信息的话)或者设置aof 模式AOF_FSYNC_ALWAYS即设置参数appendfsync具体设置appendfsync设置
- 设置系统参数vm.dirty_background_ratio=10 (未完全理解,带深入研究redis源码)