Redis集群丢数据问题——机器宕机没有自动切库,手动takeover后,老主库重启又抢回master

9 篇文章 0 订阅
3 篇文章 0 订阅

一、背景

机器发生了宕机,查看后发现有个节点没有自动切库成功,为了保证集群对外能正常服务,所以决定手动强制切库,保证集群对外的可用性。

二、事故描述

先使用cluster failover force命令执行强制切库,但是试了3次都不行,然后就使用cluster failover takeover更强制的切库,切库后,集群是能对外提供服务了。

但是当把宕机的机器重启后,开始拉起宕机机器上的Redis实例,发现之前没有自动切库的节点,现在又变成主库了,而且数据也丢了。。。

三、问题分析

3.1 日志记录

没有正常切库的从库(13808节点)的日志:

# 宕机了,陆续收到其他节点客观下线的消息
6836:S 08 May 11:37:12.204 * Marking node f4edfc38749148be941481e3d17fb14ecdc62947 as failing (quorum reached).
6836:S 08 May 11:37:12.399 * Marking node 6a82a479431cab234bb4b8330ca4dea2d409871e as failing (quorum reached).
...
6836:S 08 May 11:40:15.287 * Marking node d84c1798cf3470cdbd3bd8a2261d59c117ff918e as failing (quorum reached).

# 宕机后,从库一直尝试选取切库,但是因为这个集群在这台机器上有 11/30 的主库,所以master之间的选主通信有些问题,一直没有成功
6836:S 08 May 11:40:15.384 # Start of election delayed for 786 milliseconds (rank #0, offset 2989077250425).
6836:S 08 May 11:40:16.254 # Starting a failover election for epoch 141.
# 通信有问题,master投票选举没有过半,不能切库
6836:S 08 May 11:40:35.345 # Currently unable to failover: Waiting for votes, but majority still not reached.
6836:S 08 May 11:40:46.209 # Currently unable to failover: Failover attempt expired.
6836:S 08 May 11:41:16.242 # Start of election delayed for 751 milliseconds (rank #0, offset 2989077250425).
6836:S 08 May 11:41:16.338 # Currently unable to failover: Waiting the delay before I can start a new failover.
6836:S 08 May 11:41:17.014 # Starting a failover election for epoch 142.
6836:S 08 May 11:41:17.110 # Currently unable to failover: Waiting for votes, but majority still not reached.
6836:S 08 May 11:41:19.977 # MASTER timeout: no data nor PING received...
6836:S 08 May 11:41:19.977 # Connection with master lost.
6836:S 08 May 11:41:19.977 * Caching the disconnected master state.
6836:S 08 May 11:41:19.977 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:41:19.977 * MASTER <-> SLAVE sync started
6836:S 08 May 11:41:28.976 # Error condition on socket for SYNC: Connection timed out
...
6836:S 08 May 11:42:16.096 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:42:16.096 * MASTER <-> SLAVE sync started
6836:S 08 May 11:42:17.033 # Start of election delayed for 754 milliseconds (rank #0, offset 2989077250425).
6836:S 08 May 11:42:17.131 # Currently unable to failover: Waiting the delay before I can start a new failover.
6836:S 08 May 11:42:17.806 # Starting a failover election for epoch 143.
6836:S 08 May 11:42:17.904 # Currently unable to failover: Waiting for votes, but majority still not reached.
6836:S 08 May 11:42:25.096 # Error condition on socket for SYNC: Connection timed out
6836:S 08 May 11:42:25.108 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:42:25.108 * MASTER <-> SLAVE sync started
6836:S 08 May 11:42:34.108 # Error condition on socket for SYNC: Connection timed out
...
6836:S 08 May 11:43:10.162 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:43:10.162 * MASTER <-> SLAVE sync started
6836:S 08 May 11:43:17.883 # Start of election delayed for 886 milliseconds (rank #0, offset 2989077250425).
6836:S 08 May 11:43:17.979 # Currently unable to failover: Waiting the delay before I can start a new failover.
6836:S 08 May 11:43:18.855 # Starting a failover election for epoch 144.
6836:S 08 May 11:43:18.951 # Currently unable to failover: Waiting for votes, but majority still not reached.
6836:S 08 May 11:43:19.162 # Error condition on socket for SYNC: Connection timed out
6836:S 08 May 11:43:19.176 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:43:19.176 * MASTER <-> SLAVE sync started
6836:S 08 May 11:43:28.176 # Error condition on socket for SYNC: Connection timed out
...
6836:S 08 May 11:44:14.260 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:44:14.260 * MASTER <-> SLAVE sync started
6836:S 08 May 11:44:18.837 # Start of election delayed for 532 milliseconds (rank #0, offset 2989077250425).
6836:S 08 May 11:44:18.935 # Currently unable to failover: Waiting the delay before I can start a new failover.
6836:S 08 May 11:44:19.425 # Starting a failover election for epoch 145.
6836:S 08 May 11:44:19.522 # Currently unable to failover: Waiting for votes, but majority still not reached.
6836:S 08 May 11:44:23.260 # Error condition on socket for SYNC: Connection timed out
6836:S 08 May 11:44:23.273 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:44:23.273 * MASTER <-> SLAVE sync started

# 第一次执行 cluster failover force 进行强制切库
6836:S 08 May 11:44:28.307 # Forced failover user request accepted.
6836:S 08 May 11:44:32.273 # Error condition on socket for SYNC: Connection timed out
6836:S 08 May 11:44:32.278 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:44:32.278 * MASTER <-> SLAVE sync started
# 第一次强制切库 timed out
6836:S 08 May 11:44:33.384 # Manual failover timed out.

# 第二次执行 cluster failover force 进行强制切库
6836:S 08 May 11:44:37.955 # Forced failover user request accepted.
6836:S 08 May 11:44:41.280 # Error condition on socket for SYNC: Connection timed out
6836:S 08 May 11:44:41.292 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:44:41.292 * MASTER <-> SLAVE sync started
6836:S 08 May 11:44:42.976 # Manual failover timed out.

# 第三次执行 cluster failover force 进行强制切库
6836:S 08 May 11:44:47.440 # Forced failover user request accepted.
6836:S 08 May 11:44:49.371 # Currently unable to failover: Failover attempt expired.
6836:S 08 May 11:44:50.293 # Error condition on socket for SYNC: Connection timed out
6836:S 08 May 11:44:50.312 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 11:44:50.312 * MASTER <-> SLAVE sync started
6836:S 08 May 11:44:52.473 # Manual failover timed out.

# 执行 cluster failover takeover 切库
6836:S 08 May 11:44:55.699 # Taking over the master (user request).
# 设置 configEpoch 为 146
6836:S 08 May 11:44:55.699 # New configEpoch set to 146
# 清理之前缓存的master信息,自己作为主库开始服务
6836:M 08 May 11:44:55.699 * Discarding previously cached master state.

# 宕机机器上的其他正常切库的主节点,现在拉起来了,通信后有如下日志
6836:M 08 May 12:05:53.445 * Clear FAIL state for node 71f8ab3c35baceea3415ecc91797b5e6fe9b4136: master without slots is reachable again.
6836:M 08 May 12:13:30.911 * Clear FAIL state for node 6a82a479431cab234bb4b8330ca4dea2d409871e: master without slots is reachable again.

# 拉起该节点之前的老主库,通信后的日志
6836:M 08 May 13:09:30.309 * Clear FAIL state for node d84c1798cf3470cdbd3bd8a2261d59c117ff918e: master without slots is reachable again.
# 被老主库以更高的 epoch 投票选举,抢走了master,自己又变成了老主库的从
6836:M 08 May 13:09:37.532 # Configuration change detected. Reconfiguring myself as a replica of d84c1798cf3470cdbd3bd8a2261d59c117ff918e
6836:S 08 May 13:09:38.051 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 13:09:38.051 * MASTER <-> SLAVE sync started
6836:S 08 May 13:09:38.051 * Non blocking connect for SYNC fired the event.
6836:S 08 May 13:09:38.053 * Master replied to PING, replication can continue...
6836:S 08 May 13:09:38.053 * Partial resynchronization not possible (no cached master)
6836:S 08 May 13:09:38.053 * Full resync from master: 731c4a47d04cf0eb2af608bd5ff442850086dbc9:1
6836:S 08 May 13:09:38.353 * MASTER <-> SLAVE sync: receiving 77105 bytes from master
6836:S 08 May 13:09:39.338 * MASTER <-> SLAVE sync: Flushing old data
6836:S 08 May 13:10:23.333 * MASTER <-> SLAVE sync: Loading DB in memory
6836:S 08 May 13:10:23.335 * MASTER <-> SLAVE sync: Finished with success

宕机机器上主库(13778节点)的日志:

168063:M 08 May 13:09:04.491 * Node configuration loaded, I'm d84c1798cf3470cdbd3bd8a2261d59c117ff918e
168063:M 08 May 13:09:04.493 # Server started, Redis version 3.0.7
# 重复执行拉起操作,waring日志,可忽略?
168096:M 08 May 13:09:04.900 # Creating Server TCP listening socket *:13778: bind: Address already in use
168545:M 08 May 13:09:24.045 # Creating Server TCP listening socket *:13778: bind: Address already in use
# 实例启动成功,开始用13778端口服务
168063:M 08 May 13:09:30.299 * DB loaded from disk: 25.807 seconds
168063:M 08 May 13:09:30.300 * The server is now ready to accept connections on port 13778
# 准备作为 13808 的从库
168063:M 08 May 13:09:30.303 # Configuration change detected. Reconfiguring myself as a replica of 56613f3183ce8349dc5f502084776d7c4cf8bb81
# 这种日志不多见
168063:S 08 May 13:09:30.306 # Cluster state changed: ok

# 这条日志有点奇怪 !!
168063:S 08 May 13:09:30.399 # Start of election delayed for 769 milliseconds (rank #0, offset 0).
168063:S 08 May 13:09:30.495 # Currently unable to failover: Waiting the delay before I can start a new failover.

# 为什么它能从 epoch 147 开始选举,而且是在没有发生断连的情况下 ???
168063:S 08 May 13:09:31.263 # Starting a failover election for epoch 147.
168063:S 08 May 13:09:31.267 # Currently unable to failover: Waiting for votes, but majority still not reached.
168063:S 08 May 13:09:31.267 # Failover election won: I'm the new master.
168063:S 08 May 13:09:31.267 # configEpoch set to 147 after successful failover
168063:M 08 May 13:09:38.053 * Slave 10.142.1.15:13808 asks for synchronization
168063:M 08 May 13:09:38.053 * Full resync requested by slave 10.142.1.15:13808
168063:M 08 May 13:09:38.053 * Starting BGSAVE for SYNC with target: disk
168063:M 08 May 13:09:38.053 * Background saving started by pid 168563
168563:C 08 May 13:09:38.340 * DB saved on disk
168563:C 08 May 13:09:38.340 * RDB: 2 MB of memory used by copy-on-write
168063:M 08 May 13:09:38.353 * Background saving terminated with success
168063:M 08 May 13:09:38.353 * Synchronization with slave 10.142.1.15:13808 succeeded

3.2 关键日志

没有正常切库的从库(13808节点)的日志:

# 执行 cluster failover takeover 切库
6836:S 08 May 11:44:55.699 # Taking over the master (user request).
# 设置 configEpoch 为 146
6836:S 08 May 11:44:55.699 # New configEpoch set to 146
# 清理之前缓存的master信息,自己作为主库开始服务
6836:M 08 May 11:44:55.699 * Discarding previously cached master state.
...

# 跟该节点之前的老主库通信上了,但是看这个日志感觉这个从库没有认出这是自己之前的主库,是因为takover造成的吗?(答:即使没有takeover,主库起来之后,还是这样的日志,这是正常的)
6836:M 08 May 13:09:30.309 * Clear FAIL state for node d84c1798cf3470cdbd3bd8a2261d59c117ff918e: master without slots is reachable again.
# 被老主库以更高的 epoch 投票选举,抢走了master,自己又变成了老主库的从
6836:M 08 May 13:09:37.532 # Configuration change detected. Reconfiguring myself as a replica of d84c1798cf3470cdbd3bd8a2261d59c117ff918e
6836:S 08 May 13:09:38.051 * Connecting to MASTER 10.142.1.13:13778
6836:S 08 May 13:09:38.051 * MASTER <-> SLAVE sync started

宕机机器上主库(13778节点)的日志:

# 准备作为 13808 的从库
168063:M 08 May 13:09:30.303 # Configuration change detected. Reconfiguring myself as a replica of 56613f3183ce8349dc5f502084776d7c4cf8bb81
# 成功修改自己作为从库的状态
168063:S 08 May 13:09:30.306 # Cluster state changed: ok

# 这条日志有点奇怪,正常情况下,应该开始同步主节点的数据
# 没收到集群其他节点关于自己主节点的下线报告,直接开始延时选举
168063:S 08 May 13:09:30.399 # Start of election delayed for 769 milliseconds (rank #0, offset 0).
168063:S 08 May 13:09:30.495 # Currently unable to failover: Waiting the delay before I can start a new failover.

# 为什么它能从 epoch 147 开始选举,而且是在没有发生断连的情况下 ???
168063:S 08 May 13:09:31.263 # Starting a failover election for epoch 147.
168063:S 08 May 13:09:31.267 # Currently unable to failover: Waiting for votes, but majority still not reached.
168063:S 08 May 13:09:31.267 # Failover election won: I'm the new master.
168063:S 08 May 13:09:31.267 # configEpoch set to 147 after successful failover
168063:M 08 May 13:09:38.053 * Slave 10.142.1.15:13808 asks for synchronization

针对上述日志分析,有如下疑惑与猜测。

3.3 疑惑与猜测

  1. 为啥老主库13778被拉起来之后,是以一个比较大的 epoch 147 开始选主,而不是之前比较小的 epoch 146??是因为老从库13808下面的这个操作吗,但是这个操作能通知到这个节点吗?
6836:S 08 May 11:44:55.699 # New configEpoch set to 146
  1. 是不是因为老从库和集群其他节点的通信有问题?所以切不了库,主库起来的时候也通信有问题?
  2. 是不是因为被takeover后的节点不进行正常的选主,会直接发起选主操作,所以通过和其他节点通信后,发现之前的 epoch 是 146,所以直接 146+1 然后进行选主?
  3. 是不是因为该集群有太多的主库在同一台机器上,导致宕机时选取切换有问题,从库不能及时选举切库?
  4. 之前也出现过一次 takeover 丢数据的问题,是不是一样的原因?

针对上述猜疑和疑惑,依次进行排查。

四、原因排查

4.1 查看资料

4.1.1 集群间通信

看Redis集群间通信原理,gossip协议每次只会和部分节点进行通信,达到最终一致性。

选择通信节点的规则和消息携带的数据量

根据通信节点选择的流程可以看出消息交换的成本主要体现在

  • 单位时间选择发送消息的节点数量
  • 每个消息携带的数据量

集群内每个节点维护定时任务默认每秒执行10次,每秒会随机选取5个节点并找出最久没有通信的一个节点发送ping消息,用于保证Gossip信息交换的随机性。每100毫秒都会扫描本地节点列表,如果发现节点最近一次接受pong消息的时间大于cluster_node_timeout/2,则立刻发送ping消息,防止该节点信息太长时间未更新。

4.1.2 手动故障转移
  • cluster failover force——用于当主节点宕机且无法自动完成故障转移情况。从节点接到cluster failover force请求时,从节点直接发起选举,不再跟主节点确认复制偏移量(从节点复制延迟的数据会丢失),当从节点选举成功后替换为新的主节点并广播集群配置。
  • cluster failover takeover——用于集群内超过一半以上主节点故障的场景,因为从节点无法收到半数以上主节点投票,所以无法完成选举过程。可以执行cluster failover takeover强制转移,接到命令的从节点不再进行选举流程而是直接更新本地配置纪元并替换主节点。takeover故障转移由于没有通 过领导者选举发起故障转移,会导致配置纪元存在冲突的可能。当冲突发生时,集群会以nodeId字典序更大的一方配置为准。因此要小心集群分区后,手动执行takeover导致的集群冲突问题。
4.1.3 配置纪元的主要作用:
  • 标示集群内每个主节点的不同版本和当前集群最大的版本。
  • 每次集群发生重要事件时,这里的重要事件指出现新的主节点(新加 入的或者由从节点转换而来),从节点竞争选举。都会递增集群全局的配置 纪元并赋值给相关主节点,用于记录这一关键事件。
  • 主节点具有更大的配置纪元代表了更新的集群状态,因此当节点间进行ping/pong消息交换时,如出现slots等关键信息不一致时,以配置纪元更大的一方为准,防止过时的消息状态污染集群。

4.2 同版本集群测试

4.2.1 准备

使用3台机器部署三主三从的集群,故意把两个主库放到一台机器上,把Redis的日志级别调整为debug,然后关闭有两个主库的机器进行测试。

4.2.2 测试流程

刚开始集群的主从状态

[01][SLAVE=>MASTER]    10.172.18.25:40089  =>  10.172.18.26:40092
[02][SLAVE=>MASTER]    10.172.18.25:40091  =>  10.172.18.24:40090
[03][SLAVE=>MASTER]    10.172.18.26:40088  =>  10.172.18.24:40093

关闭 10.172.18.24 机器,此时集群 cluster nodes信息如下:

d6756cd48a6dc5664edda04845ad78fad7f95aec 10.172.18.25:40089 slave 78709e95cdec313e604505456111963ff1b0870b 0 1620961672775 8 connected
78709e95cdec313e604505456111963ff1b0870b 10.172.18.26:40092 myself,master - 0 0 8 connected 5461-10922
5faa42637410b273e3c327563f41566a6da5739d 10.172.18.24:40090 master,fail? - 1620961583314 1620961583024 3 connected 10923-16383
1dd07f03e17cb9ba4bec21e58b788c56bc80b1da 10.172.18.24:40093 master,fail? - 1620961583314 1620961578202 9 connected 0-5460
abd99dbc0fdf98276f73db0ba02936cc05be01c4 10.172.18.26:40088 slave 1dd07f03e17cb9ba4bec21e58b788c56bc80b1da 0 1620961674703 9 connected
63a2b428432082f06907b68b5e47ba3488bfa64f 10.172.18.25:40091 slave 5faa42637410b273e3c327563f41566a6da5739d 0 1620961673739 9 connected

此时,集群中两个主节点都fail了,不能正常投票切主,集群状态也是 fail 的。

在 40091 节点上执行 cluster failover takeover强制切库,过一会之后, 40088 就自动切库了,此时集群状态为 ok 。

4.2.3 日志分析

注:测试的时候,只有 40090 和 40093 节点的日志级别调整到了 debug 级别,所以 40091 节点的日志感觉有点欠缺,但是也不太影响我们的分析。

被 takeover 的主节点40090的日志:

39710:M 14 May 11:27:26.188 * Node configuration loaded, I'm 5faa42637410b273e3c327563f41566a6da5739d
39710:M 14 May 11:27:26.189 # Server started, Redis version 3.0.7
39710:M 14 May 11:27:26.190 * DB loaded from disk: 0.000 seconds
# RDB文件加载好,准备使用40090端口开始服务
39710:M 14 May 11:27:26.190 * The server is now ready to accept connections on port 40090
39710:M 14 May 11:27:26.206 - Accepted cluster node 10.172.18.26:56711
39710:M 14 May 11:27:26.206 - 0 clients connected (0 slaves), 1171904 bytes in use
# 和集群中其他节点开始通信
39710:M 14 May 11:27:26.206 . Connecting with Node 63a2b428432082f06907b68b5e47ba3488bfa64f at 10.172.18.25:50091
39710:M 14 May 11:27:26.206 . Connecting with Node abd99dbc0fdf98276f73db0ba02936cc05be01c4 at 10.172.18.26:50088
39710:M 14 May 11:27:26.206 . Connecting with Node d6756cd48a6dc5664edda04845ad78fad7f95aec at 10.172.18.25:50089
39710:M 14 May 11:27:26.206 . Connecting with Node 78709e95cdec313e604505456111963ff1b0870b at 10.172.18.26:50092
39710:M 14 May 11:27:26.206 . Connecting with Node 1dd07f03e17cb9ba4bec21e58b788c56bc80b1da at 10.172.18.24:50093
# 准备作为 40091 的从库
39710:M 14 May 11:27:26.207 # Configuration change detected. Reconfiguring myself as a replica of 63a2b428432082f06907b68b5e47ba3488bfa64f
# 状态修改成功
39710:S 14 May 11:27:26.207 # Cluster state changed: ok
39710:S 14 May 11:27:26.227 - Accepted cluster node 10.172.18.25:60560
39710:S 14 May 11:27:26.233 - Accepted cluster node 10.172.18.26:56712
39710:S 14 May 11:27:26.243 - Accepted cluster node 10.172.18.25:60561
39710:S 14 May 11:27:26.271 - Accepted cluster node 10.172.18.24:49756
39710:S 14 May 11:27:27.203 * Connecting to MASTER 10.172.18.25:40091
# 开始数据同步
39710:S 14 May 11:27:27.203 * MASTER <-> SLAVE sync started
39710:S 14 May 11:27:27.203 * Non blocking connect for SYNC fired the event.
39710:S 14 May 11:27:27.203 * Master replied to PING, replication can continue...
39710:S 14 May 11:27:27.203 * Partial resynchronization not possible (no cached master)
# 收到40091新主库发送过来的数据
39710:S 14 May 11:27:27.205 * Full resync from master: 69b222f83fdea04746388110d9ae2e6f8d2704c1:1
39710:S 14 May 11:27:27.289 * MASTER <-> SLAVE sync: receiving 2516884 bytes from master
39710:S 14 May 11:27:27.299 * MASTER <-> SLAVE sync: Flushing old data
39710:S 14 May 11:27:27.299 * MASTER <-> SLAVE sync: Loading DB in memory
# 主从同步成功
39710:S 14 May 11:27:27.473 * MASTER <-> SLAVE sync: Finished with success

执行takeover的40091节点的日志:

# 检测到 没有槽位的master 40090(因为40091之前报存的40090的信息是有槽位的,待通信更新槽位信息)
15634:M 14 May 11:27:26.244 * Clear FAIL state for node 5faa42637410b273e3c327563f41566a6da5739d: master without slots is reachable again.
# 通信好后,40090新从库开始请求数据同步
15634:M 14 May 11:27:27.204 * Slave 10.172.18.24:40090 asks for synchronization
15634:M 14 May 11:27:27.204 * Full resync requested by slave 10.172.18.24:40090
15634:M 14 May 11:27:27.204 * Starting BGSAVE for SYNC with target: disk
15634:M 14 May 11:27:27.205 * Background saving started by pid 79260
79260:C 14 May 11:27:27.280 * DB saved on disk
79260:C 14 May 11:27:27.281 * RDB: 2 MB of memory used by copy-on-write
15634:M 14 May 11:27:27.290 * Background saving terminated with success
# 主从数据同步成功
15634:M 14 May 11:27:27.298 * Synchronization with slave 10.172.18.24:40090 succeeded

40093节点上看到如下日志:

# 5faa 为被takeover的主节点40090, 63a2为执行 takeover的从节点40091
# 检测到 5faa 记录的槽位信息的 epoch 比当前节点记录的 epoch 小,于是更新 epoch 更大的 63a2 记录的槽位信息到集群中
32704:S 14 May 11:27:26.207 - Node 5faa42637410b273e3c327563f41566a6da5739d has old slots configuration, sending an UPDATE message about 63a2b428432082f06907b68b5e47ba3488bfa64f
32704:S 14 May 11:27:26.271 . Connecting with Node 5faa42637410b273e3c327563f41566a6da5739d at 10.172.18.24:50090
32704:S 14 May 11:27:26.271 * Clear FAIL state for node 5faa42637410b273e3c327563f41566a6da5739d: master without slots is reachable again.
32704:S 14 May 11:27:26.436 - Node 63a2b428432082f06907b68b5e47ba3488bfa64f reported node 5faa42637410b273e3c327563f41566a6da5739d is back online.
32704:S 14 May 11:27:30.030 - Node 78709e95cdec313e604505456111963ff1b0870b reported node 5faa42637410b273e3c327563f41566a6da5739d is back online.
32704:S 14 May 11:27:30.319 - DB 0: 61827 keys (332 volatile) in 65536 slots HT.
32704:S 14 May 11:27:30.319 - 41 clients connected (0 slaves), 15433296 bytes in use
32704:S 14 May 11:27:35.203 - Node abd99dbc0fdf98276f73db0ba02936cc05be01c4 reported node 5faa42637410b273e3c327563f41566a6da5739d is back online.

上述日志跟我们出问题的集群的日志有些不一样,40090和40091还比较好理解,但 40093 节点上的debug日志不多见,值得分析。

4.2.4 根据日志查看源码逻辑
4.2.4.1 has old slots configuration

根据has old slots configuration查看源码,找到如下逻辑:

int clusterProcessPacket(clusterLink *link) {

    ...
    /* PING, PONG, MEET: process config information. */
    if (type == CLUSTERMSG_TYPE_PING || type == CLUSTERMSG_TYPE_PONG ||
        type == CLUSTERMSG_TYPE_MEET)
    {
        ...
        
        /* 2) We also check for the reverse condition, that is, the sender
         *    claims to serve slots we know are served by a master with a
         *    greater configEpoch. If this happens we inform the sender.
         *     
         *    检测和条件 1 的相反条件,也即是,
         *    sender 处理的槽的配置纪元比当前节点已知的某个节点的配置纪元要低,
         *    如果是这样的话,通知 sender 。
         *
         * This is useful because sometimes after a partition heals, a
         * reappearing master may be the last one to claim a given set of
         * hash slots, but with a configuration that other instances know to
         * be deprecated. Example:
         * 
         * A and B are master and slave for slots 1,2,3.
         * A is partitioned away, B gets promoted.
         * B is partitioned away, and A returns available.
         * 
         * 这种情况可能会出现在网络分裂中,
         * 一个重新上线的主节点可能会带有已经过时的槽布局。
         * 比如说:
         * A 负责槽 1 、 2 、 3 ,而 B 是 A 的从节点。
         * A 从网络中分裂出去,B 被提升为主节点。
         * B 从网络中分裂出去, A 重新上线(但是它所使用的槽布局是旧的)。
         *
         * Usually B would PING A publishing its set of served slots and its
         * configEpoch, but because of the partition B can't inform A of the
         * new configuration, so other nodes that have an updated table must
         * do it. In this way A will stop to act as a master (or can try to
         * failover if there are the conditions to win the election).
         * 
         * 在正常情况下, B 应该向 A 发送 PING 消息,告知 A ,自己(B)已经接替了
         * 槽 1、 2、 3 ,并且带有更高的配置纪元,但因为网络分裂的缘故,节点 B 没办法通知节点 A ,
         * 所以通知节点 A 它带有的槽布局已经更新的工作就交给其他知道 B 带有更高配置纪元的节点来做。
         * 当 A 接到其他节点关于节点 B 的消息时,
         * 节点 A 就会停止自己的主节点工作,又或者重新进行故障转移。
         *  */
        if (sender && dirty_slots) {
            int j;

            for (j = 0; j < REDIS_CLUSTER_SLOTS; j++) {

                // 检测 slots 中的槽 j 是否已经被指派
                if (bitmapTestBit(hdr->myslots,j)) {

                    // 当前节点认为槽 j 由 sender 负责处理,
                    // 或者当前节点认为该槽未指派,那么跳过该槽
                    if (server.cluster->slots[j] == sender ||
                        server.cluster->slots[j] == NULL) continue;

                    // 当前节点槽 j 的配置纪元比 sender 的配置纪元要大
                    if (server.cluster->slots[j]->configEpoch >
                        senderConfigEpoch)
                    {
                        redisLog(REDIS_VERBOSE,
                            "Node %.40s has old slots configuration, sending "
                            "an UPDATE message about %.40s",
                                sender->name, server.cluster->slots[j]->name);

                        // 向 sender 发送关于槽 j 的更新信息
                        clusterSendUpdate(sender->link,
                            server.cluster->slots[j]);

                        /* TODO: instead of exiting the loop send every other
                         * UPDATE packet for other nodes that are the new owner
                         * of sender's slots. */
                        break;
                    }
                }
            }
        }
    }
}

代码注释里的A就是我们这个场景中被takeover的老主库40090,B就是执行takeover的新主库40091。

实验中,40093节点发现40090发送过来的槽位信息是老的,所以将40091的新槽位信息发送给了40090,40090收到信息后就更新了自己的状态,这个没有问题。

4.2.4.2 is back online

再看is back online的日志相关的源码如下:

void clusterProcessGossipSection(clusterMsg *hdr, clusterLink *link) {
    // 记录这条消息中包含了多少个节点的信息
    uint16_t count = ntohs(hdr->count);

    // 指向第一个节点的信息
    clusterMsgDataGossip *g = (clusterMsgDataGossip*) hdr->data.ping.gossip;

    // 取出发送者
    clusterNode *sender = link->node ? link->node : clusterLookupNode(hdr->sender);

    // 遍历所有节点的信息
    while(count--) {
        ...
        // 节点已经存在于当前节点
        if (node) {
            /* We already know this node.
               Handle failure reports, only when the sender is a master. */
            // 如果 sender 是一个主节点,那么我们需要处理下线报告
            if (sender && nodeIsMaster(sender) && node != myself) {
                // 节点处于 FAIL 或者 PFAIL 状态
                if (flags & (REDIS_NODE_FAIL|REDIS_NODE_PFAIL)) {
                    // 添加 sender 对 node 的下线报告
                    if (clusterNodeAddFailureReport(node,sender)) {
                        redisLog(REDIS_VERBOSE,
                            "Node %.40s reported node %.40s as not reachable.",
                            sender->name, node->name);
                    }

                    // 尝试将 node 标记为 FAIL
                    markNodeAsFailingIfNeeded(node);
                
                // 节点处于正常状态
                } else {

                    // 如果 sender 曾经发送过对 node 的下线报告
                    // 那么清除该报告
                    if (clusterNodeDelFailureReport(node,sender)) {
                        redisLog(REDIS_VERBOSE,
                            "Node %.40s reported node %.40s is back online.",
                            sender->name, node->name);
                    }
                }
            }
        }
    }
}

所以is back online相关的日志就是其他节点重新更新 40090节点为在线状态的日志。

经过以上同版本集群操作的测试以及源代码的查看后,确认我们的takeover操作没有错,并且正常情况下takeover也是不会丢数据的,所以还是怀疑是集群间网络通信的问题

4.3 网络问题确认

4.3.1 异常信息

在13778 节点上查看 info replication 信息时,13808从库的复制信息没有问题,state=online,但是查看 cluster nodes 时,13808的节点状态是 fail 的!!

为什么会出现这个奇怪的现象,我们看到 13778 和 13808 节点已经数据同步成功了,为什么 13808节点的状态确是 fail 的呢?

4.3.2 信息确认

在到集群其他节点上查看关于13808节点的集群状态信息:

全都是 fail 的!!!怪不得没有自动切主,并且被抢走了 master,因为集群里其他节点都以为13808节点是 fail 状态,肯定不会给它投票的。

那为什么 13808 节点没有正常的进行集群间通信呢?

4.3.3 集群间通信问题排查
4.3.3.1 通信端口连接情况

到13808节点所在的机器上,使用 ss -an 查看端口的连接情况:

注:13799 是同一台机器上的另一个从库。
从图中可以看出,13808节点的集群通信端口 23808 只有一个LISTEN状态的连接,明显不能正常服务;13799节点的集群通信端口23799有很多建立好的连接,可以跟集群其他节点的集群通信端口正常通信。

4.3.3.2 抓包确认

使用tcpdump进行抓包,看到 23808 端口的通信数据如下:

看到了大量的 Flags [R.] 的数据,也就是 [RST, ACK],使用wireshark查看详细信息如下:

这种数据表示 异常关闭连接,连接重置,说明 23808 端口确实是因为什么原因导致不能通信了。

4.4 之前疑惑的解答

经过上面的分析,我们现在可以解答之前的猜疑和困惑点了。

  1. 为啥老主库13778被拉起来之后,是以一个比较大的 epoch 147 开始选主,而不是之前比较小的 epoch 146??是因为老从库13808下面的这个操作吗,但是这个操作能通知到这个节点吗?
6836:S 08 May 11:44:55.699 # New configEpoch set to 146
  • 这个应该是可以通知到新起来的老主的,因为集群里其他节点已经认同当前的 epoch 已经到 146 了,老主起来之后是有跟集群的其他节点通信的
  1. 是不是因为老从库和集群其他节点的通信有问题?所以切不了库,主库起来的时候也通信有问题?
  • 是的,就是因为 13808老主库的集群通信端口不能正常通信,所以不会自动切库,并且之后还被抢走了master。
  1. 是不是因为被takeover后的节点不进行正常的选主,会直接发起选主操作,所以通过和其他节点通信后,发现之前的 epoch 是 146,所以直接 146+1 然后进行选主?
  • 使用相同版本的集群尝试复现后发现,正常情况下,Redis实例机器宕机,被takeover再拉起后,就不会再以更高的 epoch 进行选主,而是会成为执行takeover节点的从库,开始同步数据,这样的话数据就不会丢失了。
  1. 是不是因为该集群有太多的主库在同一台机器上,导致宕机时选取切换有问题,从库不能及时选举切库?
  • 这个不是根本原因,但是也要避免太多的主库在同一台机器上,这样不利于维护。
  1. 之前也出现过一次 takeover 丢数据的问题,是不是一样的原因?
  • 之前takeover丢数据,是因为需要发起切库的从库刚开始的时候就没有没其他节点识别到(查看其他节点的cluster_known_nodes监控数据,就是少一个节点,而且redis.log里也没有那个节点的node_id)

五、总结

5.1 后续操作

  1. 关闭 13808 实例
  2. 删除 13808 的 nodes-13808.conf 文件
  3. 重启 13808 实例(此时nodeid已经变了,集群中其他节点的关于13808之前nodeid的状态变为 :0 slave,fail,noaddr)
  4. 13808节点meet进集群 redis-cli -p 13808 cluster meet 10.142.1.13 13778(此时可以看到13808节点的cluster nodes 信息就没有 fail 了,不过是以一个新身份(新nodeid)加入到了集群中)
[root@xxx redis_13778]# redis-cli -p 13778 cluster nodes|grep d84c1
d84c1798cf3470cdbd3bd8a2261d59c117ff918e xxx:13778 myself,master - 0 0 147 connected 3823-4368
56613f3183ce8349dc5f502084776d7c4cf8bb81 :0 slave,fail,noaddr d84c1798cf3470cdbd3bd8a2261d59c117ff918e 1620450544491 1620450544491 147 disconnected
0bb7e64bad8285febf97b966aae0124507531d0a xxx:13808 slave d84c1798cf3470cdbd3bd8a2261d59c117ff918e 0 1621238789158 147 connected

注意:在没有 meet 之前,13808已经跟13778同步上了,所以没有执行 cluster replicate 命令。(这个问题也值得分析确认)

5.2 急救措施

如果之后再遇到这种情况,可以参考如下操作:

  • 使用异地备份数据进行恢复
  • 如果是双机房,并且集群分片一样,槽位一样的话,可以用一个机房的RDB数据恢复另一个机房的

5.3 反思

  • 分析定位问题时,先多角度查看,再依次深入分析,不能一下子扎的太深,钻牛角尖了
  • 这个问题应该是比较容易发现的才对,但是因为对Redis集群原理还是不太熟,被一些日志迷惑了,所以分析的时候饶了一些弯路;
  • 不过也正好借这个机会,又更深入的学习了一下Redis集群原理

5.4 改进措施

  • 大集群不能过度分配在某几台机器上
  • 即使不幸发生了切主不成功,在拉起 takeover 的节点时,需要备份下数据,即使丢失数据,还是能恢复较多的数据
  • 修改源码,在做完主从同步之前,先move dump.rdb 文件,确保数据没问题之后,再删除 dump.rdb.bak 文件
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值