ZooKeeper技术内幕（四）

最新推荐文章于 2021-10-23 20:35:18 发布

李大洲

最新推荐文章于 2021-10-23 20:35:18 发布

阅读量114

点赞数

分类专栏： ZooKeeper 文章标签： ZooKeeper技术内幕 Leader选举算法剖析

本文链接：https://blog.csdn.net/lidazhou/article/details/99871590

版权

ZooKeeper 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

六、Leader选举：

1、概述

1）服务器启动时的Leader选举：

我们假设有3台服务器组成一个集群。server1、server2、server3。myid分别是1、2、3。

①每个server会发出一个投票，每次投票包含的最基本元素有：所选举的服务器的myid和ZXID。刚开始都会将自己作为Leader服务器来进行投票。所以server1的投票为（1,0），server2的投票为（2,0），server3的投票为（3,0），然后将投票发送给集群中的其他所有机器。

②接收来自各个服务器的投票

每个服务器接收来自其他服务器的投票。

③处理投票

针对每一个投票，服务器需要将别人的投票和自己的投票进行PK。规则是：优先检查ZXID，ZXID比较大的服务器优先作为leader；如果ZXID相同，比较myid，myid比较大的服务器优先作为Leader。更新自己的投票，然后将投票重新发出去。

④统计投票

投票后，服务器统计投票，判断是否已有过半的机器收到投票结果。

⑤改变服务器状态

一旦确定Leader，每个服务器改变自己的状态，如果是Follower，就变更为FOLLOWING，如果是Leader，变更为LEADING。

2）服务器运行期间的Leader选举：

一旦Leader所在的机器挂了，整个集群暂时将无法对外服务，而是进入新一轮的Leader选举。

①变更状态：

所有非Observer服务器都会将自己的服务器状态变更为LOOKING；

②每个server会发出一个投票

在这个过程中，生成投票信息（myid，ZXID）。在第一轮投票中，将票投给自己，然后将这个投票发给集群中所有机器；

③接收来自各个服务器的投票

④处理投票

⑤统计投票

⑥改变服务器状态

2、Leader选举的实现细节

1）服务器状态

ServerState枚举类中列举了服务器的四种状态：LOOKING、FOLLOWING、LEADING、OBSERVING。

2）投票数据结构

Vote类定义了几个字段：id、zxid、electionEpoch、peerEpoch、state、version。

3）QuorumCnxManager：Leader选举过程中的网络IO

①recvQueue：消息接收队列

②queueSendMap：消息发送队列

③lastMessageSent：最近发送过的消息

3、算法核心

主要在FastLeaderElection类的lookForLeader方法中完成的，下面详解这个方法。

1）原子变量logicalclock表示leader选举伦次，先自增加一；

2）初始化选票：包括leader、zxid、epoch，初始化阶段，每台服务器将自己推举为Leader；

3）发送选票：sendNotifications方法；

4）接收外部投票：recvqueue.poll()方法接收；

5）判断接收到的投票是否是LOOKING状态，如果是，判断选举轮次electionEpoch：

①外部投票的选举轮次大于内部投票：设置自己的选举轮次，清除接收到的投票，然后使用初始化投票totalOrderPredicate判断是否需要变更内部投票，最后将内部投票=发送出去。

②外部投票的选举轮次小于内部投票：直接break。

③外部投票的选举轮次和内部投票一致：进行选票PK。

6）选票PK：

FastLeaderElection类的totalOrderPredicate只是判断是否需要进行内部投票变更。从单个方面进行考虑：选举轮次、ZXID、SID。

如果外部选举轮次大于内部选举，需要进行投票变更；

如果选举轮次一致，对比ZXID，如果外部选举的ZXID大于内部投票，需要变更；

如果ZXID一致，对比SID，如果外部投票SID大于内部投票，需要进行投票变更；

7）变更投票

如果确定了外部投票由于内部投票，需要进行投票变更，updatePeoposal方法中，用外部投票的信息覆盖内部投票；之后sendNotifications方法将变更后的内部投票发送出去；

8）选票归档

无论是否进行投票变更，都会将接收到的选票放入recvset中归档。

9）统计投票

统计集群中是否有过半的服务器认可当前的内部投票，如果已有过半服务器认可了，终止投票。

10）更新服务器状态

先判断自己是否是Leader服务器，如果是，更新自己为leader，否则确定自己是FLOOOWING或者OBSERVING。

下面附上源码：

public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                else if(self.getVotingView().containsKey(n.sid)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    LOG.warn("Ignoring notification from non-cluster member " + n.sid);
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

李大洲

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ZooKeeper技术内幕（四）

六、Leader选举：1、概述1）服务器启动时的Leader选举：我们假设有3台服务器组成一个集群。server1、server2、server3。myid分别是1、2、3。①每个server会发出一个投票，每次投票包含的最基本元素有：所选举的服务器的myid和ZXID。刚开始都会将自己作为Leader服务器来进行投票。所以server1的投票为（1,0），server2的投票为（...
复制链接

扫一扫

专栏目录