Redis集群分析(30)

1、 头领选举

在(29)中解析了主客观下线的方法,在解析客观下线的时候没有解析哨兵间同步数据的方式。这个方式与头领选举时同步数据的方式相同,所以将其放到本文来解析。

哨兵间选举头领使用的是Raft算法。所以需要先简单介绍一下raft的选举算法。在raft算法中服务器被分为了三种角色:Leader, Follower, Candidate。其中Candidate是候选者,只在选举过程中出现。同时使用epoch表示选举纪元,例如第一次选举epoch为1,第二次选举epoch为2。哨兵在选举时遵循相同的规则:在同一纪元中只对一个头领投票,投票的头领是其最先收到的候选者。

选举流程:首先在一开始Raft算法中所有的节点都是Follower,然后根据某种机制来触发选举,一般来说是心跳机制,但redis的哨兵不是,它使用的是客观下线。触发了选举机制的节点会将自身的身份变为Candidate,然后向其他节点发送投票请求。如果在一定时间内选出了Leader,那么选举结束。如果没有选出,则开启下一轮选举。

哨兵在实现raft算法时,与一般的实现有所区别。在接下来的源码中会详细解析,首先接着(29)继续分析。在sentinelHandleRedisInstance方法中执行完客观下线检查后会继续执行一个if语句,如下图:

在这里插入图片描述

if语句中有了一个sentinelStartFailoverIfNeeded方法,这个方法在服务器主观下线后会开启故障迁移,故障迁移的第一步是在哨兵中选举出一个头领,然后由这个头领来执行对主从服务器的故障迁移。如果开始了故障迁移,那么该方法的返回为true。然后执行sentinelAskMasterStateToOtherSentinels方法,这个方法会向其他哨兵服务器发送请求同步器数据或者进行选举。最后还有一个sentinelFailoverStateMachine方法,这是一个实现了类似于状态机机制的方法,它会根据故障迁移执行的阶段来执行不同的方法。

上述三个方法会逐个详细解析,这里先解析sentinelStartFailoverIfNeeded方法,其内容如下:

/* This function checks if there are the conditions to start the failover,
 * that is:
 *
 * 1) Master must be in ODOWN condition.
 * 2) No failover already in progress.
 * 3) No failover already attempted recently.
 *
 * We still don't know if we'll win the election so it is possible that we
 * start the failover but that we'll not be able to act.
 *
 * Return non-zero if a failover was started. */
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
    /* We can't failover if the master is not in O_DOWN state. */
    if (!(master->flags & SRI_O_DOWN)) return 0;

    /* Failover already in progress? */
    if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;

    /* Last failover attempt started too little time ago? */
    if (mstime() - master->failover_start_time <
        master->failover_timeout*2)
    {
        if (master->failover_delay_logged != master->failover_start_time) {
            time_t clock = (master->failover_start_time +
                            master->failover_timeout*2) / 1000;
            char ctimebuf[26];

            ctime_r(&clock,ctimebuf);
            ctimebuf[24] = '\0'; /* Remove newline. */
            master->failover_delay_logged = master->failover_start_time;
            serverLog(LL_WARNING,
                "Next failover delay: I will not start a failover before %s",
                ctimebuf);
        }
        return 0;
    }

    sentinelStartFailover(master);
    return 1;
}

这个方法很简单,三个if条件代表三种不会执行故障转移的条件。第一种是主服务器不是客观下线;第二种是已经开始进行故障转移;第三种是上一次尝试开始故障转移的时间距离现在很近。
如果不是上述三种情况代表着需要开始进行故障转移,它会调用sentinelStartFailover方法将故障转移的状态设置为SENTINEL_FAILOVER_STATE_WAIT_START。其具体内容如下:

/* Setup the master state to start a failover. */
void sentinelStartFailover(sentinelRedisInstance *master) {
    serverAssert(master->flags & SRI_MASTER);

    master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
    master->flags |= SRI_FAILOVER_IN_PROGRESS;
    master->failover_epoch = ++sentinel.current_epoch;
    sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
        (unsigned long long) sentinel.current_epoch);
    sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
    master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    master->failover_state_change_time = mstime();
}

这个方法主要是对一些参数进行赋值,首先是第5行的failover_state,这个参数代表了故障转移的状态,整个故障转移有多个步骤,redis将不同的步骤设置成了不同的状态,在服务器运行时会根据不同的状态来执行对应步骤的方法。

然后是第6行的flags,这里可以和上面sentinelStartFailoverIfNeeded中的第二个if条件有联系。防止多次重复进行故障转移。

然后是第7行的failover_epoch,这个参数代表了raft算法中的epoch。

最后是记录了两个时间failover_start_time和failover_start_time。第一个代表着故障转移开始的时间,第二个是故障转移状态变化的时间。

这个方法执行完成后,sentinelStartFailoverIfNeeded也就解析完了。如果这个方法的返回值为1,那么就会执行sentinelAskMasterStateToOtherSentinels方法,开始与其他哨兵通信。这一步在raft的选举中相当于开始投票了。在raft算法中选举需要Candidate,Candidate是由Follower转换来到,一般是根据心跳机制来确定的,当某个Follower发现与leader的心跳超时后,它就会将自身转换成Candidate开启新一轮选举。

而在哨兵中这个机制是客观下线,将Follower转换成candidate的方法则是上述解析的sentinelStartFailoverIfNeeded方法。转换成candidate后会立马想其他哨兵服务器发送投票命令,这步是通过sentinelAskMasterStateToOtherSentinels方法来实现的,其具体内容如下:

/* If we think the master is down, we start sending
 * SENTINEL IS-MASTER-DOWN-BY-ADDR requests to other sentinels
 * in order to get the replies that allow to reach the quorum
 * needed to mark the master in ODOWN state and trigger a failover. */
#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
    dictIterator *di;
    dictEntry *de;

    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) {
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        }

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not receive the info within SENTINEL_ASK_PERIOD ms. */
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        if (ri->link->disconnected) continue;
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval = redisAsyncCommand(ri->link->cc,
                    sentinelReceiveIsMasterDownReply, ri,
                    "%s is-master-down-by-addr %s %s %llu %s",
                    sentinelInstanceMapCommand(ri,"SENTINEL"),
                    master->addr->ip, port,
                    sentinel.current_epoch,
                    (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
                    sentinel.myid : "*");
        if (retval == C_OK) ri->link->pending_commands++;
    }
    dictReleaseIterator(di);
}

这个方法实际很简单,首先取出存储了的所有哨兵(第10行),然后使用while循环遍历所有的哨兵(第11行),并调用redisAsyncCommand方法(第37行)向该哨兵发送数据。

redisAsyncCommand方法在(27)中解析过,这里主要解析其发送的参数。其中有两个参数较为重要,一个是处理返回值的sentinelReceiveIsMasterDownReply方法,还有一个是发送的参数"%s is-master-down-by-addr %s %s %llu %s"。

首先解析其发送的参数。参数中的第一个%s的值是第40行方法执行的结果,sentinelInstanceMapCommand方法在之前也提到过,它主要是为了解决命令重命名的问题,若没有修改过命令的名称,那么它的值为SENTINEL。第二个%s是主服务器的ip,第三个%s主服务器的port,第四个参数(%llu)是当前选举的纪元。最后一个%s有些特殊,它分为两种情况,一种是开始故障转移后发送的是其服务器id即sentinel.myid,否则发送“*”。在第43行出现的master->failover_state参数,在上文解析sentinelStartFailover方法中有提到,它会被赋予一个新的值:SENTINEL_FAILOVER_STATE_WAIT_START。这个参数的实际值为1,代表着哨兵选举这个阶段,而比较的另一个参数SENTINEL_FAILOVER_STATE_NONE的值实际为0,代表着未进行故障转移。故障转移中还有其他阶段和对应的参数,但都是大于0的参数。

综上所述,在这时哨兵发送的命令如下:

SENTINEL is-master-down-by-addr <ip> <port> <current_ epoch> < runid>

这里需要注意的是这个命令是发送给其他哨兵服务器。在(26)中解析了哨兵的启动方式,其实际使用的是redis数据库的启动方法,所以其处理命令的方法与redis数据库相同,区别在于其实际使用的命令不同。在(26)中提到了在哨兵启动中会调用initSentinel,这个方法如下:

/* Perform the Sentinel mode initialization. */
void initSentinel(void) {
    unsigned int j;

    /* Remove usual Redis commands from the command table, then just add
     * the SENTINEL command. */
    dictEmpty(server.commands,NULL);
    for (j = 0; j < sizeof(sentinelcmds)/sizeof(sentinelcmds[0]); j++) {
        int retval;
        struct redisCommand *cmd = sentinelcmds+j;

        retval = dictAdd(server.commands, sdsnew(cmd->name), cmd);
        serverAssert(retval == DICT_OK);
    }

    /* Initialize various data structures. */
    sentinel.current_epoch = 0;
    sentinel.masters = dictCreate(&instancesDictType,NULL);
    sentinel.tilt = 0;
    sentinel.tilt_start_time = 0;
    sentinel.previous_time = mstime();
    sentinel.running_scripts = 0;
    sentinel.scripts_queue = listCreate();
    sentinel.announce_ip = NULL;
    sentinel.announce_port = 0;
    sentinel.simfailure_flags = SENTINEL_SIMFAILURE_NONE;
    sentinel.deny_scripts_reconfig = SENTINEL_DEFAULT_DENY_SCRIPTS_RECONFIG;
    memset(sentinel.myid,0,sizeof(sentinel.myid));
}

在第7到14行,这里会删除原来存储的命令,然后从一个名叫sentinelcmds的参数中循环遍历数据,添加成新的命令。sentinelcmds参数的内容如下:

struct redisCommand sentinelcmds[] = {
    {"ping",pingCommand,1,"",0,NULL,0,0,0,0,0},
    {"sentinel",sentinelCommand,-2,"",0,NULL,0,0,0,0,0},
    {"subscribe",subscribeCommand,-2,"",0,NULL,0,0,0,0,0},
    {"unsubscribe",unsubscribeCommand,-1,"",0,NULL,0,0,0,0,0},
    {"psubscribe",psubscribeCommand,-2,"",0,NULL,0,0,0,0,0},
    {"punsubscribe",punsubscribeCommand,-1,"",0,NULL,0,0,0,0,0},
    {"publish",sentinelPublishCommand,3,"",0,NULL,0,0,0,0,0},
    {"info",sentinelInfoCommand,-1,"",0,NULL,0,0,0,0,0},
    {"role",sentinelRoleCommand,1,"l",0,NULL,0,0,0,0,0},
    {"client",clientCommand,-2,"rs",0,NULL,0,0,0,0,0},
    {"shutdown",shutdownCommand,-1,"",0,NULL,0,0,0,0,0},
    {"auth",authCommand,2,"sltF",0,NULL,0,0,0,0,0}
};

这里我们可以看见SENTINEL命令调用的是一个名叫sentinelCommand的方法来处理的,这个方法处理is-master-down-by-addr参数的内容如下:

void sentinelCommand(client *c) {
    if (!strcasecmp(c->argv[1]->ptr,"masters")) {
    
    ...
    
    } else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
        /* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
         *
         * Arguments:
         *
         * ip and port are the ip and port of the master we want to be
         * checked by Sentinel. Note that the command will not check by
         * name but just by master, in theory different Sentinels may monitor
         * differnet masters with the same name.
         *
         * current-epoch is needed in order to understand if we are allowed
         * to vote for a failover leader or not. Each Sentinel can vote just
         * one time per epoch.
         *
         * runid is "*" if we are not seeking for a vote from the Sentinel
         * in order to elect the failover leader. Otherwise it is set to the
         * runid we want the Sentinel to vote if it did not already voted.
         */
        sentinelRedisInstance *ri;
        long long req_epoch;
        uint64_t leader_epoch = 0;
        char *leader = NULL;
        long port;
        int isdown = 0;

        if (c->argc != 6) goto numargserr;
        if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
            getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
                                                              != C_OK)
            return;
        ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
            c->argv[2]->ptr,port,NULL);

        /* It exists? Is actually a master? Is subjectively down? It's down.
         * Note: if we are in tilt mode we always reply with "0". */
        if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
                                    (ri->flags & SRI_MASTER))
            isdown = 1;

        /* Vote for the master (or fetch the previous vote) if the request
         * includes a runid, otherwise the sender is not seeking for a vote. */
        if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
            leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
                                            c->argv[5]->ptr,
                                            &leader_epoch);
        }

        /* Reply with a three-elements multi-bulk reply:
         * down state, leader, vote epoch. */
        addReplyMultiBulkLen(c,3);
        addReply(c, isdown ? shared.cone : shared.czero);
        addReplyBulkCString(c, leader ? leader : "*");
        addReplyLongLong(c, (long long)leader_epoch);
        if (leader) sdsfree(leader);
    } else if (!strcasecmp(c->argv[1]->ptr,"reset")) {
    
    ...
}

在上文解析命令的对最后一个参数runid有两个取值,一个是在哨兵选举的时候发送的runid,另一个是其他时候发送的“”。在发送“”的时候这个命令的主要作用是同步哨兵间的主观下线状态。在之前解析客观下线时提到的数据同步过程便是这时进行的,而这个方法在实际运行的时候是被循环调用的,所以其实是实时同步的。

上述代码中,首先是24到37行,这里主要在解析命令的参数,并通过参数找到对应的服务器实例(ri参数)。然后是第41行,这里是在检查该哨兵的主观下线判断,这里可以看见它判断的方式是直接检查对应ri的标识。然后是47行对应了选举时的处理方式,这里主要通过sentinelVoteLeader方法进行选举投票。最后是55行到59行,这里会向发送命令的服务器返回一些数据。数据有三个,首先是第56行返回的是主观下线状态(0或1),然后是57行这里返回选举的结果,最后是58行这里返回的是选举的纪元。

选举投票调用的sentinelVoteLeader方法如下:

char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
    if (req_epoch > sentinel.current_epoch) {
        sentinel.current_epoch = req_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
            (unsigned long long) sentinel.current_epoch);
    }

    if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
    {
        sdsfree(master->leader);
        master->leader = sdsnew(req_runid);
        master->leader_epoch = sentinel.current_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
            master->leader, (unsigned long long) master->leader_epoch);
        /* If we did not voted for ourselves, set the master failover start
         * time to now, in order to force a delay before we can start a
         * failover for the same master. */
        if (strcasecmp(master->leader,sentinel.myid))
            master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    }

    *leader_epoch = master->leader_epoch;
    return master->leader ? sdsnew(master->leader) : NULL;
}

这里投票的方式遵循raft算法的投票方式,即在同一纪元中将票投给其最先接收到的候选者,并且同一纪元中只投一次票。

首先是第2行的if语句,这里会比较命令请求的纪元和其自身存储的纪元,若请求的纪元更大,则代表开启了新一轮选举,那么这里它需要更新自己的自己的纪元并进行投票。在第2行的代码中只是进行了纪元的更新(第3行)。然后是第9行的if语句,这里才是进行投票的地方,投票的方式很简单,即记录runid(第12行)。最后再返回投票的结果(即master->leader参数)。

解析完接收命令这一端的操作后,我们继续返回发送命令的哨兵服务器,查看它如何处理命令的返回值。

在上文我们提到了在发送命令的时候,其注册了sentinelReceiveIsMasterDownReply方法来处理命令的返回值,这两个方法的内容如下:

void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = privdata;
    instanceLink *link = c->data;
    redisReply *r;

    if (!reply || !link) return;
    link->pending_commands--;
    r = reply;

    /* Ignore every error or unexpected reply.
     * Note that if the command returns an error for any reason we'll
     * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
    if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
        r->element[0]->type == REDIS_REPLY_INTEGER &&
        r->element[1]->type == REDIS_REPLY_STRING &&
        r->element[2]->type == REDIS_REPLY_INTEGER)
    {
        ri->last_master_down_reply_time = mstime();
        if (r->element[0]->integer == 1) {
            ri->flags |= SRI_MASTER_DOWN;
        } else {
            ri->flags &= ~SRI_MASTER_DOWN;
        }
        if (strcmp(r->element[1]->str,"*")) {
            /* If the runid in the reply is not "*" the Sentinel actually
             * replied with a vote. */
            sdsfree(ri->leader);
            if ((long long)ri->leader_epoch != r->element[2]->integer)
                serverLog(LL_WARNING,
                    "%s voted for %s %llu", ri->name,
                    r->element[1]->str,
                    (unsigned long long) r->element[2]->integer);
            ri->leader = sdsnew(r->element[1]->str);
            ri->leader_epoch = r->element[2]->integer;
        }
    }
}

通过之前的解析可以知道,返回值有三个,第一个是该哨兵的主观下线的状态,对这个参数的处理在第19行,这里主要是赋值操作。然后是第二个参数他返回的选举的投票的结果,从之前的代码可以知道,在未选举的时候返回的是“*”,选举时返回的选举的服务器的runid。当返回的是选举的runid的时候,24行中的代码会被执行,这里主要也是赋值操作(第33,34行)。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值