故障表象:
业务层面显示提示查询redis失败
集群组成:
3主3从,每个节点的数据有8GB
机器分布:
在同一个机架中,
xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201
redis-server进程状态:
通过命令ps -eo pid,lstart | grep $pid,
发现进程已经持续运行了3个月
发生故障前集群的节点状态:
xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave
---------------------------------下面是日志分析--------------------------------------
步1:
主节点8371失去和从节点8373的连接:
46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.
步2:
主节点8370/8375判定8371失联:
42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).
步3:
从节点8372/8373/8374收到主节点8375说8371失联:
46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65
步4:
主节点8370/8375授权8373升级为主节点转移:
42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 f