文章目录
1、哨兵的介绍
1.1 哨兵集群的介绍和功能
一般情况,我们为了保证一个东西的正常运转,会创造另一个东西看着他,就和明朝的东厂以及锦衣卫一般。所以我们为了redis
稳定运行,出问题能够及时恢复它,于是就有了哨兵。也就是看着 redis
集群, master
不行了我就把你换下来,把 slave
换成 master
,等之前的 master
重启好了,就挂到新的 master
下,充当一个 slave
。
哨兵,一个肯定不行,哨兵本身也是分布式的,作为一个哨兵集群去运行,互相协同工作,哨兵可以视为是特殊的 redis
,但是他不做其他的事,专门干监视和换人的事,明朝的东厂🤣🤣🤣,可以和 redis
部署在同一个机器上,也就是相当于同个机器多个实例,但是使用不同的端口。
一个哨兵,要是挂了,怎么办,必须多个。就算是不挂,要是它判断有问题,那就惨了,所以多个就是降低了系统出问题的概率,要是多个同时都出问题了,那…认栽了,这个概率和喝水被呛死可能一样,极少,但是可能发生。
sentinal
,中文名是哨兵,哨兵是redis
集群架构中非常重要的一个组件,主要功能如下:
- (1)集群监控,负责监控
redis master
和slave
进程是否正常工作 - (2)消息通知,如果某个
redis
实例有故障,那么哨兵负责发送消息作为报警通知给管理员 - (3)故障转移,如果
master node
挂掉了,会自动转移到slave node
上 - (4)配置中心,如果故障转移发生了,通知
client
客户端新的master
地址
1.2 哨兵监控任务
哨兵,平时主要任务有三个:
- 获取最新拓扑图
- 发布订阅哨兵信息以及状态
- 心跳检测
1.2.1 获取最新拓扑图
每一个哨兵都每隔10秒就会向 msater node
和 slave node
发送 info
命令获取到最新的拓扑结构图。所谓拓扑图,也就是它们之间的主从关系图。我们配置的时候,只需要配置监控 master node
即可,通过想 master 节点发送info
信息也可以获取到slave 节点的信息,如果有新的slave加入也可以感知到。
1.2.2 发布订阅哨兵信息以及状态
每个哨兵每隔两秒钟,就会向redis执行的频道发送该哨兵自身的信息,以及对各个节点的判断,所有的哨兵都会订阅这个频道,通过这个来了解其他哨兵,以及其他哨兵对redis节点当前的状态。这个主要是通过public/subscribe来实现的,节省资源。
1.2.3 心跳检测
每个哨兵,每哥一秒就会向主节点以及从节点和其他哨兵节点发起一次ping命令,做心跳检测,判断是否正常。
2、从源码看哨兵任务都干了啥
2.1 哨兵定时器
下面我们从源码看,server.c
下面有一个函数serverCon()
,里面判断是哨兵的时候,会执行一个哨兵的定时器,就是定时的去做一些操作
这个定时器的操作如下:
void sentinelTimer(void) {
// 记录一些调用时间,判断是否满足TITL模式的条件
sentinelCheckTiltCondition();
/**
* 主要逻辑,执行定期的任务
* 1.ping操作,分析master和slave的INFO信息
* 2.发布自身的信息和订阅其他服务器的信息
* 3.执行故障转移等等
*/
sentinelHandleDictOfRedisInstances(sentinel.masters);
// 下面都是一些脚本操作
// 哨兵运行挂起的脚本
sentinelRunPendingScripts();
// 哨兵收集终止的脚本
sentinelCollectTerminatedScripts();
// 哨兵干掉超时任务脚本
sentinelKillTimedoutScripts();
/* We continuously change the frequency of the Redis "timer interrupt"
* in order to desynchronize every Sentinel from every other.
* This non-determinism avoids that Sentinels started at the same time
* exactly continue to stay synchronized asking to be voted at the
* same time again and again (resulting in nobody likely winning the
* election because of split brain voting). */
server.hz = CONFIG_DEFAULT_HZ + rand() % CONFIG_DEFAULT_HZ;
}
2.1.1 sentinelCheckTiltCondition()函数
那第一个sentinelCheckTiltCondition
是干嘛的呢?TITL模式又是什么东西?
不懂就查:https://redis.io/topics/sentinel
TILT modeRedis Sentinel is heavily dependent on the computer time: for instance in order to understand if an instance is available it remembers the time of the latest successful reply to the PING command, and compares it with the current time to understand how old it is.
However if the computer time changes in an unexpected way, or if the computer is very busy, or the process blocked for some reason, Sentinel may start to behave in an unexpected way.
The TILT mode is a special “protection” mode that a Sentinel can enter when something odd is detected that can lower the reliability of the system. The Sentinel timer interrupt is normally called 10 times per second, so we expect that more or less 100 milliseconds will elapse between two calls to the timer interrupt.
What a Sentinel does is to register the previous time the timer interrupt was called, and compare it with the current call: if the time difference is negative or unexpectedly big (2 seconds or more) the TILT mode is entered (or if it was already entered the exit from the TILT mode postponed).
When in TILT mode the Sentinel will continue to monitor everything, but:
It stops acting at all.
It starts to reply negatively to SENTINEL is-master-down-by-addr requests as the ability to detect a failure is no longer trusted.
If everything appears to be normal for 30 second, the TILT mode is exited.
Note that in some way TILT mode could be replaced using the monotonic clock API that many kernels offer. However it is not still clear if this is a good solution since the current system avoids issues in case the process is just suspended or not executed by the scheduler for a long time.
翻译一下: