Redis集群分析（30）

最新推荐文章于 2022-06-21 10:29:33 发布

huserblog

最新推荐文章于 2022-06-21 10:29:33 发布

阅读量168

点赞数

文章标签： redis nosql

本文链接：https://blog.csdn.net/qq_39210987/article/details/111145814

版权

1、头领选举

在（29）中解析了主客观下线的方法，在解析客观下线的时候没有解析哨兵间同步数据的方式。这个方式与头领选举时同步数据的方式相同，所以将其放到本文来解析。

哨兵间选举头领使用的是Raft算法。所以需要先简单介绍一下raft的选举算法。在raft算法中服务器被分为了三种角色：Leader, Follower, Candidate。其中Candidate是候选者，只在选举过程中出现。同时使用epoch表示选举纪元，例如第一次选举epoch为1，第二次选举epoch为2。哨兵在选举时遵循相同的规则：在同一纪元中只对一个头领投票，投票的头领是其最先收到的候选者。

选举流程：首先在一开始Raft算法中所有的节点都是Follower，然后根据某种机制来触发选举，一般来说是心跳机制，但redis的哨兵不是，它使用的是客观下线。触发了选举机制的节点会将自身的身份变为Candidate，然后向其他节点发送投票请求。如果在一定时间内选出了Leader，那么选举结束。如果没有选出，则开启下一轮选举。

哨兵在实现raft算法时，与一般的实现有所区别。在接下来的源码中会详细解析，首先接着（29）继续分析。在sentinelHandleRedisInstance方法中执行完客观下线检查后会继续执行一个if语句，如下图：

在这里插入图片描述

if语句中有了一个sentinelStartFailoverIfNeeded方法，这个方法在服务器主观下线后会开启故障迁移，故障迁移的第一步是在哨兵中选举出一个头领，然后由这个头领来执行对主从服务器的故障迁移。如果开始了故障迁移，那么该方法的返回为true。然后执行sentinelAskMasterStateToOtherSentinels方法，这个方法会向其他哨兵服务器发送请求同步器数据或者进行选举。最后还有一个sentinelFailoverStateMachine方法，这是一个实现了类似于状态机机制的方法，它会根据故障迁移执行的阶段来执行不同的方法。

上述三个方法会逐个详细解析，这里先解析sentinelStartFailoverIfNeeded方法，其内容如下：

/* This function checks if there are the conditions to start the failover,
 * that is:
 *
 * 1) Master must be in ODOWN condition.
 * 2) No failover already in progress.
 * 3) No failover already attempted recently.
 *
 * We still don't know if we'll win the election so it is possible that we
 * start the failover but that we'll not be able to act.
 *
 * Return non-zero if a failover was started. */
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
    /* We can't failover if the master is not in O_DOWN state. */
    if (!(master->flags & SRI_O_DOWN)) return 0;

    /* Failover already in progress? */
    if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;

    /* Last failover attempt started too little time ago? */
    if (mstime() - master->failover_start_time <
        master->failover_timeout*2)
    {
        if (master->failover_delay_logged != master->failover_start_time) {
            time_t clock = (master->failover_start_time +
                            master->failover_timeout*2) / 1000;
            char ctimebuf[26];

            ctime_r(&clock,ctimebuf);
            ctimebuf[24] = '\0'; /* Remove newline. */
            master->failover_delay_logged = master->failover_start_time;
            serverLog(LL_WARNING,
                "Next failover delay: I will not start a failover before %s",
                ctimebuf);
        }
        return 0;
    }

    sentinelStartFailover(master);
    return 1;
}

这个方法很简单，三个if条件代表三种不会执行故障转移的条件。第一种是主服务器不是客观下线；第二种是已经开始进行故障转移；第三种是上一次尝试开始故障转移的时间距离现在很近。
如果不是上述三种情况代表着需要开始进行故障转移，它会调用sentinelStartFailover方法将故障转移的状态设置为SENTINEL_FAILOVER_STATE_WAIT_START。其具体内容如下：

/* Setup the master state to start a failover. */
void sentinelStartFailover(sentinelRedisInstance *master) {
    serverAssert(master->flags & SRI_MASTER);

    master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
    master->flags |= SRI_FAILOVER_IN_PROGRESS;
    master->failover_epoch = ++sentinel.current_epoch;
    sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
        (unsigned long long) sentinel.current_epoch);
    sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
    master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    master->failover_state_change_time = mstime();
}

这个方法主要是对一些参数进行赋值，首先是第5行的failover_state，这个参数代表了故障转移的状态，整个故障转移有多个步骤，redis将不同的步骤设置成了不同的状态，在服务器运行时会根据不同的状态来执行对应步骤的方法。

然后是第6行的flags，这里可以和上面sentinelStartFailoverIfNeeded中的第二个if条件有联系。防止多次重复进行故障转移。

然后是第7行的failover_epoch，这个参数代表了raft算法中的epoch。

最后是记录了两个时间failover_start_time和failover_start_time。第一个代表着故障转移开始的时间，第二个是故障转移状态变化的时间。

这个方法执行完成后，sentinelStartFailoverIfNeeded也就解析完了。如果这个方法的返回值为1，那么就会执行sentinelAskMasterStateToOtherSentinels方法，开始与其他哨兵通信。这一步在raft的选举中相当于开始投票了。在raft算法中选举需要Candidate，Candidate是由Follower转换来到，一般是根据心跳机制来确定的，当某个Follower发现与leader的心跳超时后，它就会将自身转换成Candidate开启新一轮选举。

而在哨兵中这个机制是客观下线，将Follower转换成candidate的方法则是上述解析的sentinelStartFailoverIfNeeded方法。转换成candidate后会立马想其他哨兵服务器发送投票命令，这步是通过sentinelAskMasterStateToOtherSentinels方法来实现的，其具体内容如下：

/* If we think the master is down, we start sending
 * SENTINEL IS-MASTER-DOWN-BY-ADDR requests to other sentinels
 * in order to get the replies that allow to reach the quorum
 * needed to mark the master in ODOWN state and trigger a failover. */
#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
    dictIterator *di;
    dictEntry *de;

    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) {
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        }

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not receive the info within SENTINEL_ASK_PERIOD ms. */
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        if (ri->link->disconnected) continue;
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval = redisAsyncCommand(ri->link->cc,
                    sentinelReceiveIsMasterDownReply, ri,
                    "%s is-master-down-by-addr %s %s %llu %s",
                    sentinelInstanceMapCommand(ri,"SENTINEL"),
                    master->addr->ip, port,
                    sentinel.current_epoch,
                    (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
                    sentinel.myid : "*");
        if (retval == C_OK) ri->link->pending_commands++;
    }
    dictReleaseIterator(di);
}

这个方法实际很简单，首先取出存储了的所有哨兵（第10行），然后使用while循环遍历所有的哨兵（第11行），并调用redisAsyncCommand方法（第37行）向该哨兵发送数据。

redisAsyncCommand方法在（27）中解析过，这里主要解析其发送的参数。其中有两个参数较为重要，一个是处理返回值的sentinelReceiveIsMasterDownReply方法，还有一个是发送的参数"%s is-master-down-by-addr %s %s %llu %s"。

首先解析其发送的参数。参数中的第一个%s的值是第40行方法执行的结果，sentinelInstanceMapCommand方法在之前也提到过，它主要是为了解决命令重命名的问题，若没有修改过命令的名称，那么它的值为SENTINEL。第二个%s是主服务器的ip，第三个%s主服务器的port，第四个参数（%llu）是当前选举的纪元。最后一个%s有些特殊，它分为两种情况，一种是开始故障转移后发送的是其服务器id即sentinel.myid，否则发送“*”。在第43行出现的master->failover_state参数，在上文解析sentinelStartFailover方法中有提到，它会被赋予一个新的值：SENTINEL_FAILOVER_STATE_WAIT_START。这个参数的实际值为1，代表着哨兵选举这个阶段，而比较的另一个参数SENTINEL_FAILOVER_STATE_NONE的值实际为0，代表着未进行故障转移。故障转移中还有其他阶段和对应的参数，但都是大于0的参数。

综上所述，在这时哨兵发送的命令如下：

SENTINEL is-master-down-by-addr <ip> <port> <current_ epoch> < runid>

这里需要注意的是这个命令是发送给其他哨兵服务器。在（26）中解析了哨兵的启动方式，其实际使用的是redis数据库的启动方法，所以其处理命令的方法与redis数据库相同，区别在于其实际使用的命令不同。在（26）中提到了在哨兵启动中会调用initSentinel，这个方法如下：

/* Perform the Sentinel mode initialization. */
void initSentinel(void) {
    unsigned int j;

    /* Remove usual Redis commands from the command table, then just add
     * the SENTINEL command. */
    dictEmpty(server.commands,NULL);
    for (j = 0; j < sizeof(sentinelcmds)/sizeof(sentinelcmds[0]); j++) {
        int retval;
        struct redisCommand *cmd = sentinelcmds+j;

        retval = dictAdd(server.commands, sdsnew(cmd->name), cmd);
        serverAssert(retval == DICT_OK);
    }

    /* Initialize various data structures. */
    sentinel.current_epoch = 0;
    sentinel.masters = dictCreate(&instancesDictType,NULL);
    sentinel.tilt = 0;
    sentinel.tilt_start_time = 0;
    sentinel.previous_time = mstime();
    sentinel.running_scripts = 0;
    sentinel.scripts_queue = listCreate();
    sentinel.announce_ip = NULL;
    sentinel.announce_port = 0;
    sentinel.simfailure_flags = SENTINEL_SIMFAILURE_NONE;
    sentinel.deny_scripts_reconfig = SENTINEL_DEFAULT_DENY_SCRIPTS_RECONFIG;
    memset(sentinel.myid,0,sizeof(sentinel.myid));
}

在第7到14行，这里会删除原来存储的命令，然后从一个名叫sentinelcmds的参数中循环遍历数据，添加成新的命令。sentinelcmds参数的内容如下：

struct redisCommand sentinelcmds[] = {
    {"ping",pingCommand,1,"",0,NULL,0,0,0,0,0},
    {"sentinel",sentinelCommand,-2,"",0,NULL,0,0,0,0,0},
    {"subscribe",subscribeCommand,-2,"",0,NULL,0,0,0,0,0},
    {"unsubscribe",unsubscribeCommand,-1,"",0,NULL,0,0,0,0,0},
    {"psubscribe",psubscribeCommand,-2,"",0,NULL,0,0,0,0,0},
    {"punsubscribe",punsubscribeCommand,-1,"",0,NULL,0,0,0,0,0},
    {"publish",sentinelPublishCommand,3,"",0,NULL,0,0,0,0,0},
    {"info",sentinelInfoCommand,-1,"",0,NULL,0,0,0,0,0},
    {"role",sentinelRoleCommand,1,"l",0,NULL,0,0,0,0,0},
    {"client",clientCommand,-2,"rs",0,NULL,0,0,0,0,0},
    {"shutdown",shutdownCommand,-1,"",0,NULL,0,0,0,0,0},
    {"auth",authCommand,2,"sltF",0,NULL,0,0,0,0,0}
};

这里我们可以看见SENTINEL命令调用的是一个名叫sentinelCommand的方法来处理的，这个方法处理is-master-down-by-addr参数的内容如下：

void sentinelCommand(client *c) {
    if (!strcasecmp(c->argv[1]->ptr,"masters")) {
    
    ...
    
    } else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
        /* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
         *
         * Arguments:
         *
         * ip and port are the ip and port of the master we want to be
         * checked by Sentinel. Note that the command will not check by
         * name but just by master, in theory different Sentinels may monitor
         * differnet masters with the same name.
         *
         * current-epoch is needed in order to understand if we are allowed
         * to vote for a failover leader or not. Each Sentinel can vote just
         * one time per epoch.
         *
         * runid is "*" if we are not seeking for a vote from the Sentinel
         * in order to elect the failover leader. Otherwise it is set to the
         * runid we want the Sentinel to vote if it did not already voted.
         */
        sentinelRedisInstance *ri;
        long long req_epoch;
        uint64_t leader_epoch = 0;
        char *leader = NULL;
        long port;
        int isdown = 0;

        if (c->argc != 6) goto numargserr;
        if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
            getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
                                                              != C_OK)
            return;
        ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
            c->argv[2]->ptr,port,NULL);

        /* It exists? Is actually a master? Is subjectively down? It's down.
         * Note: if we are in tilt mode we always reply with "0". */
        if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
                                    (ri->flags & SRI_MASTER))
            isdown = 1;

        /* Vote for the master (or fetch the previous vote) if the request
         * includes a runid, otherwise the sender is not seeking for a vote. */
        if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
            leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
                                            c->argv[5]->ptr,
                                            &leader_epoch);
        }

        /* Reply with a three-elements multi-bulk reply:
         * down state, leader, vote epoch. */
        addReplyMultiBulkLen(c,3);
        addReply(c, isdown ? shared.cone : shared.czero);
        addReplyBulkCString(c, leader ? leader : "*");
        addReplyLongLong(c, (long long)leader_epoch);
        if (leader) sdsfree(leader);
    } else if (!strcasecmp(c->argv[1]->ptr,"reset")) {
    
    ...
}

在上文解析命令的对最后一个参数runid有两个取值，一个是在哨兵选举的时候发送的runid，另一个是其他时候发送的“”。在发送“”的时候这个命令的主要作用是同步哨兵间的主观下线状态。在之前解析客观下线时提到的数据同步过程便是这时进行的，而这个方法在实际运行的时候是被循环调用的，所以其实是实时同步的。

上述代码中，首先是24到37行，这里主要在解析命令的参数，并通过参数找到对应的服务器实例（ri参数）。然后是第41行，这里是在检查该哨兵的主观下线判断，这里可以看见它判断的方式是直接检查对应ri的标识。然后是47行对应了选举时的处理方式，这里主要通过sentinelVoteLeader方法进行选举投票。最后是55行到59行，这里会向发送命令的服务器返回一些数据。数据有三个，首先是第56行返回的是主观下线状态（0或1），然后是57行这里返回选举的结果，最后是58行这里返回的是选举的纪元。

选举投票调用的sentinelVoteLeader方法如下：

char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
    if (req_epoch > sentinel.current_epoch) {
        sentinel.current_epoch = req_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
            (unsigned long long) sentinel.current_epoch);
    }

    if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
    {
        sdsfree(master->leader);
        master->leader = sdsnew(req_runid);
        master->leader_epoch = sentinel.current_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
            master->leader, (unsigned long long) master->leader_epoch);
        /* If we did not voted for ourselves, set the master failover start
         * time to now, in order to force a delay before we can start a
         * failover for the same master. */
        if (strcasecmp(master->leader,sentinel.myid))
            master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    }

    *leader_epoch = master->leader_epoch;
    return master->leader ? sdsnew(master->leader) : NULL;
}

这里投票的方式遵循raft算法的投票方式，即在同一纪元中将票投给其最先接收到的候选者，并且同一纪元中只投一次票。

首先是第2行的if语句，这里会比较命令请求的纪元和其自身存储的纪元，若请求的纪元更大，则代表开启了新一轮选举，那么这里它需要更新自己的自己的纪元并进行投票。在第2行的代码中只是进行了纪元的更新（第3行）。然后是第9行的if语句，这里才是进行投票的地方，投票的方式很简单，即记录runid（第12行）。最后再返回投票的结果（即master->leader参数）。

解析完接收命令这一端的操作后，我们继续返回发送命令的哨兵服务器，查看它如何处理命令的返回值。

在上文我们提到了在发送命令的时候，其注册了sentinelReceiveIsMasterDownReply方法来处理命令的返回值，这两个方法的内容如下：

void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = privdata;
    instanceLink *link = c->data;
    redisReply *r;

    if (!reply || !link) return;
    link->pending_commands--;
    r = reply;

    /* Ignore every error or unexpected reply.
     * Note that if the command returns an error for any reason we'll
     * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
    if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
        r->element[0]->type == REDIS_REPLY_INTEGER &&
        r->element[1]->type == REDIS_REPLY_STRING &&
        r->element[2]->type == REDIS_REPLY_INTEGER)
    {
        ri->last_master_down_reply_time = mstime();
        if (r->element[0]->integer == 1) {
            ri->flags |= SRI_MASTER_DOWN;
        } else {
            ri->flags &= ~SRI_MASTER_DOWN;
        }
        if (strcmp(r->element[1]->str,"*")) {
            /* If the runid in the reply is not "*" the Sentinel actually
             * replied with a vote. */
            sdsfree(ri->leader);
            if ((long long)ri->leader_epoch != r->element[2]->integer)
                serverLog(LL_WARNING,
                    "%s voted for %s %llu", ri->name,
                    r->element[1]->str,
                    (unsigned long long) r->element[2]->integer);
            ri->leader = sdsnew(r->element[1]->str);
            ri->leader_epoch = r->element[2]->integer;
        }
    }
}

通过之前的解析可以知道，返回值有三个，第一个是该哨兵的主观下线的状态，对这个参数的处理在第19行，这里主要是赋值操作。然后是第二个参数他返回的选举的投票的结果，从之前的代码可以知道，在未选举的时候返回的是“*”，选举时返回的选举的服务器的runid。当返回的是选举的runid的时候，24行中的代码会被执行，这里主要也是赋值操作（第33,34行）。

huserblog

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Redis集群分析（30）

1、头领选举在（29）中解析了主客观下线的方法，在解析客观下线的时候没有解析哨兵间同步数据的方式。这个方式与头领选举时同步数据的方式相同，所以将其放到本文来解析。哨兵间选举头领使用的是Raft算法。所以需要先简单介绍一下raft的选举算法。在raft算法中服务器被分为了三种角色：Leader, Follower, Candidate。其中Candidate是候选者，只在选举过程中出现。同时使用epoch表示选举纪元，例如第一次选举epoch为1，第二次选举epoch为2。哨兵在选举时遵循相同的规则：在
复制链接

扫一扫