Redis集群分析（34）

最新推荐文章于 2024-09-10 21:44:14 发布

huserblog

最新推荐文章于 2024-09-10 21:44:14 发布

阅读量114

点赞数

文章标签： redis 数据库 nosql

本文链接：https://blog.csdn.net/qq_39210987/article/details/113916503

版权

1、故障转移

在（33）中，我们分析了故障转移时的SEND_SLAVEOF_NOONE状态和WAIT_PROMOTION状态。在解析WAIT_PROMOTION状态的时候提到了它只是进行了超时处理，没有对操作成功的情况进行处理，即没有修改failover_state的值。

而在解析SEND_SLAVEOF_NOONE状态的时候提到了在其代码的注释中提到了这个操作的是否成功会通过info命令来确认。

info命令在解析哨兵如何发现从服务器的时候提到过。首先是在文档（24）中解析了redis服务器在接收到info命令时候的处理方式。然后是在文档（28）中解析了哨兵服务器是在什么时候会向redis服务器发送info命令以及如何处理其返回值。

这里和info命令相关的处理有两处，从上述提到的注释来分析应该是和文档（28）中的代码相关。进行故障转移的哨兵服务器通过不断向选择的从服务器发送info命令，然后根据其返回值，判断操作是否成功。

在文档（28）中提到了这个info命令是在哨兵服务器经常调用的sentinelHandleRedisInstance方法中发送的。sentinelHandleRedisInstance方法会调用sentinelSendPeriodicCommands方法向redis服务器发送一系列命令，其中就包括info命令。在发送info命令的时候会注册sentinelInfoReplyCallback方法来处理info命令的返回值。这个方法在（28）中解析过，它会继续调用sentinelRefreshInstanceInfo方法来处理info命令返回的字符串。在（28）中，我们解析了这个方法中和发现从服务器相关的部分代码，这里我们解析和故障转移相关的部分代码，其内容如下：

/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
   ...
      /* Handle slave -> master role switch. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) {
        /* If this is a promoted slave we can change state to the
         * failover state machine. */
        if ((ri->flags & SRI_PROMOTED) &&
            (ri->master->flags & SRI_FAILOVER_IN_PROGRESS) &&
            (ri->master->failover_state ==
                SENTINEL_FAILOVER_STATE_WAIT_PROMOTION))
        {
            /* Now that we are sure the slave was reconfigured as a master
             * set the master configuration epoch to the epoch we won the
             * election to perform this failover. This will force the other
             * Sentinels to update their config (assuming there is not
             * a newer one already available). */
            ri->master->config_epoch = ri->master->failover_epoch;
            ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;
            ri->master->failover_state_change_time = mstime();
            sentinelFlushConfig();
            sentinelEvent(LL_WARNING,"+promoted-slave",ri,"%@");
            if (sentinel.simfailure_flags &
                SENTINEL_SIMFAILURE_CRASH_AFTER_PROMOTION)
                sentinelSimFailureCrash();
            sentinelEvent(LL_WARNING,"+failover-state-reconf-slaves",
                ri->master,"%@");
            sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER,
                "start",ri->master->addr,ri->addr);
            sentinelForceHelloUpdateForMaster(ri->master);
        } else {
            /* A slave turned into a master. We want to force our view and
             * reconfigure as slave. Wait some time after the change before
             * going forward, to receive new configs if any. */
            mstime_t wait_time = SENTINEL_PUBLISH_PERIOD*4;

            if (!(ri->flags & SRI_PROMOTED) &&
                 sentinelMasterLooksSane(ri->master) &&
                 sentinelRedisInstanceNoDownFor(ri,wait_time) &&
                 mstime() - ri->role_reported_time > wait_time)
            {
                int retval = sentinelSendSlaveOf(ri,
                        ri->master->addr->ip,
                        ri->master->addr->port);
                if (retval == C_OK)
                    sentinelEvent(LL_NOTICE,"+convert-to-slave",ri,"%@");
            }
        }
    }
   ...
}

这个if语句处理的是原先的从服务器转变成主服务器的情况，即SEND_SLAVEOF_NOONE状态的操作。然后细看if语句内的处理。首先是第8行，这里又是一个if语句。它的判断条件有点多，8、9、10、11行都是。首先看第8行的条件，这个条件是用来判断返回info消息的这台redis是服务器是否是之前SELECT_SLAVE状态是选中的服务器。这里用来判断的SRI_PROMOTED参数在文档（32）分析SELECT_SLAVE状态中提到的sentinelFailoverSelectSlave方法有出现。然后是第9行，这个条件用来判断集群是否在进行故障转移。最后是第10和11行，这个条件是用来判断故障转移的状态是否是WAIT_PROMOTION，这就和文档（33）最后的故障转移状态联系起来了。

如果这个if语句的条件成立，则代表之前的一系列故障转移操作成功，选中的从服务器成功的转变为主服务器。这个if语句内则主要是一些赋值操作，其中最重要的是第19行，这里将failover_state的状态变化为了SENTINEL_FAILOVER_STATE_RECONF_SLAVES。

最后是第31行这里的else语句，这里是用来处理超时之后的情况的。即在WAIT_PROMOTION阶段的超时时间内没有收到info信息的时候，哨兵会执行（33）中分析的方法退出这次故障转移。但如果在其退出后收到了info消息的时候会执行else中的代码来进行处理。

在选出了新的主服务器后，还需要以该服务器建立一个主从架构的集群。这些操作在接下来的状态RECONF_SLAVES中完成。该状态执行的方法如下：
在这里插入图片描述

这里调用的sentinelFailoverReconfNextSlave方法，内容如下：

/* Send SLAVE OF <new master address> to all the remaining slaves that
 * still don't appear to have the configuration updated. */
void sentinelFailoverReconfNextSlave(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    int in_progress = 0;

    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG))
            in_progress++;
    }
    dictReleaseIterator(di);

    di = dictGetIterator(master->slaves);
    while(in_progress < master->parallel_syncs &&
          (de = dictNext(di)) != NULL)
    {
        sentinelRedisInstance *slave = dictGetVal(de);
        int retval;

        /* Skip the promoted slave, and already configured slaves. */
        if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;

        /* If too much time elapsed without the slave moving forward to
         * the next state, consider it reconfigured even if it is not.
         * Sentinels will detect the slave as misconfigured and fix its
         * configuration later. */
        if ((slave->flags & SRI_RECONF_SENT) &&
            (mstime() - slave->slave_reconf_sent_time) >
            SENTINEL_SLAVE_RECONF_TIMEOUT)
        {
            sentinelEvent(LL_NOTICE,"-slave-reconf-sent-timeout",slave,"%@");
            slave->flags &= ~SRI_RECONF_SENT;
            slave->flags |= SRI_RECONF_DONE;
        }

        /* Nothing to do for instances that are disconnected or already
         * in RECONF_SENT state. */
        if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)) continue;
        if (slave->link->disconnected) continue;

        /* Send SLAVEOF <new master>. */
        retval = sentinelSendSlaveOf(slave,
                master->promoted_slave->addr->ip,
                master->promoted_slave->addr->port);
        if (retval == C_OK) {
            slave->flags |= SRI_RECONF_SENT;
            slave->slave_reconf_sent_time = mstime();
            sentinelEvent(LL_NOTICE,"+slave-reconf-sent",slave,"%@");
            in_progress++;
        }
    }
    dictReleaseIterator(di);

    /* Check if all the slaves are reconfigured and handle timeout. */
    sentinelFailoverDetectEnd(master);
}

如同之前分析的，这个状态下需要重新建立一个主从模式的集群。回顾之前的分析，redis建立主从模式的方法其实很简单发送slaveof命令便可，所以这个方法其实很简单，就是遍历所有的从服务器，向其发送slaveof命令便可。

然后继续分析sentinelFailoverReconfNextSlave方法：

首先是第8行到第15行，这里从参数master->slaves（这个参数在解析哨兵如何发现从发服务器的时候出现过，哨兵发现的从服务器都存储在该参数中）中取出所有的服务器，统计无需发送命令的服务器的数量。然后是第17行到第56行，同样使用while循环遍历所有服务器，跳过一些不需要的服务器，然后未跳过的执行第46行的sentinelSendSlaveOf方法。这个方法我们在（33）中解析过，他会向指定服务器发送slaveof命令。

最后是第56行的sentinelFailoverDetectEnd方法，检查所有的服务器是否已经配置完成。其内容如下：

void sentinelFailoverDetectEnd(sentinelRedisInstance *master) {
    int not_reconfigured = 0, timeout = 0;
    dictIterator *di;
    dictEntry *de;
    mstime_t elapsed = mstime() - master->failover_state_change_time;

    /* We can't consider failover finished if the promoted slave is
     * not reachable. */
    if (master->promoted_slave == NULL ||
        master->promoted_slave->flags & SRI_S_DOWN) return;

    /* The failover terminates once all the reachable slaves are properly
     * configured. */
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;
        if (slave->flags & SRI_S_DOWN) continue;
        not_reconfigured++;
    }
    dictReleaseIterator(di);

    /* Force end of failover on timeout. */
    if (elapsed > master->failover_timeout) {
        not_reconfigured = 0;
        timeout = 1;
        sentinelEvent(LL_WARNING,"+failover-end-for-timeout",master,"%@");
    }

    if (not_reconfigured == 0) {
        sentinelEvent(LL_WARNING,"+failover-end",master,"%@");
        master->failover_state = SENTINEL_FAILOVER_STATE_UPDATE_CONFIG;
        master->failover_state_change_time = mstime();
    }

    /* If I'm the leader it is a good idea to send a best effort SLAVEOF
     * command to all the slaves still not reconfigured to replicate with
     * the new master. */
    if (timeout) {
        dictIterator *di;
        dictEntry *de;

        di = dictGetIterator(master->slaves);
        while((de = dictNext(di)) != NULL) {
            sentinelRedisInstance *slave = dictGetVal(de);
            int retval;

            if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE|SRI_RECONF_SENT)) continue;
            if (slave->link->disconnected) continue;

            retval = sentinelSendSlaveOf(slave,
                    master->promoted_slave->addr->ip,
                    master->promoted_slave->addr->port);
            if (retval == C_OK) {
                sentinelEvent(LL_NOTICE,"+slave-reconf-sent-be",slave,"%@");
                slave->flags |= SRI_RECONF_SENT;
            }
        }
        dictReleaseIterator(di);
    }
}

这个方法的主要作用如上所述，其重点在第31行，当所有的配置完成后会将failover_state的值设置为SENTINEL_FAILOVER_STATE_UPDATE_CONFIG。

huserblog

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Redis集群分析（34）

1、故障转移在（33）中，我们分析了故障转移时的SEND_SLAVEOF_NOONE状态和WAIT_PROMOTION状态。在解析WAIT_PROMOTION状态的时候提到了它只是进行了超时处理，没有对操作成功的情况进行处理，即没有修改failover_state的值。而在解析SEND_SLAVEOF_NOONE状态的时候提到了在其代码的注释中提到了这个操作的是否成功会通过info命令来确认。info命令在解析哨兵如何发现从服务器的时候提到过。首先是在文档（24）中解析了redis服务器在接收到in
复制链接

扫一扫