zookeeper选举机制（参考官方文档和源码）

最新推荐文章于 2025-03-08 09:40:26 发布

zonahaha

最新推荐文章于 2025-03-08 09:40:26 发布

阅读量3.2k

点赞数 2

文章标签： zookeeper leader 选举

本文链接：https://blog.csdn.net/zonahaha/article/details/81535718

版权

阿里面试的时候面试官为了我关于zookeeper中如果leader挂了的话剩下的follower如何选举的问题，当时没有回答上来，现在总结一下：

网上的版本基本只有一个，三种选举方式，LeaderElection,FastLeaderElection和AuthFastLeaderElection，我对比了源码和博客分析之后，还是有些不同。于是决定自己找一下。

官网地址主要参考两个leader选举地址：Leader Election

Leader Election

A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea is to have a znode, say "/election", such that each znode creates a child znode "/election/n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper automatically appends a sequence number that is greater that any one previously appended to a child of "/election". The process that created the znode with the smallest appended sequence number is the leader.

That's not all, though. It is important to watch for failures of the leader, so that a new client arises as the new leader in the case the current leader fails. A trivial solution is to have all application processes watching upon the current smallest znode, and checking if they are the new leader when the smallest znode goes away (note that the smallest znode will go away if the leader fails because the node is ephemeral). But this causes a herd effect: upon of failure of the current leader, all other processes receive a notification, and execute getChildren on "/election" to obtain the current list of children of "/election". If the number of clients is large, it causes a spike on the number of operations that ZooKeeper servers have to process. To avoid the herd effect, it is sufficient to watch for the next znode down on the sequence of znodes. If a client receives a notification that the znode it is watching is gone, then it becomes the new leader in the case that there is no smaller znode. Note that this avoids the herd effect by not having all clients watching the same znode.

Here's the pseudo code:

Let ELECTION be a path of choice of the application. To volunteer to be a leader:

Create znode z with path "ELECTION/n_" with both SEQUENCE and EPHEMERAL flags;

Let C be the children of "ELECTION", and i be the sequence number of z;

Watch for changes on "ELECTION/n_j", where j is the largest sequence number such that j < i and n_j is a znode in C;

Upon receiving a notification of znode deletion:

Let C be the new set of children of ELECTION;

If z is the smallest node in C, then execute leader procedure;

Otherwise, watch for changes on "ELECTION/n_j", where j is the largest sequence number such that j < i and n_j is a znode in C;

Note that the znode having no preceding znode on the list of children does not imply that the creator of this znode is aware that it is the current leader. Applications may consider creating a separate znode to acknowledge that the leader has executed the leader procedure.

leader election方法是一种简单的leader选举方式，主要利用SEQUENCE|EPHEMERAL标志，这个标志是当创造znodes时表示客户端的“proposals”。思想是找一个znode，发出“/election",然后每个znode创造一个拥有SEQUENCE|EPHEMERAL的child znode"/election/n_"。有了这些flags，zookeeper自动附加一个比所有之前的child的序列号更大的序列号，拥有最小附加序列号的产生出来的zonde就是leader。

为了监测leader，一种方法是监测leader上所有的应用进程，并且当最小的znode消失时检查他们是不是新的leader，但是这样有一个herd effect：一旦当前leader发生故障，所有其他的进程收到一个通知，并且执行getChildren在"/election"上，来获取当前的children的列表，如果客户端的数量很大，就会导致zookeeper 服务器需要处理大量的操作因而造成spike。为了避免这种效应，有必要监测znode序列号上的下一个znode。如果一个客户端收到通知他监测的znode消失了，然后他就会变成新的leader，以防没有新的最小的znode。这是采取非所有clients都监测同一个znode的方法来避免herd effect。

这里可以参考集群管理章节：

其他节点是如何知道某一个节点挂掉了：在zookeeper上创建一个EPHEMERAL类型的目录节点，然后每个server在他们创建目录节点的父目录节点上调用getChildren（String path， boolean watch）方法并设置watch为true，由于是EPHEMERAL节点，当一个节点死去是，这时getchildren上的Watch就会被调用，其他节点就知道某台server死去了。当一个leader节点挂掉的时候，由于它是一个ephemeral节点，死去的server对应的节点也被删除，就会出现一个最小编号的节点，就作为新的leader节点，就实现了动态选择leader

伪代码：

为了资源成为一个leader：

1. 让C成为新的ELECTION的children的集合；

2. 如果z是C中最小的节点，就执行leader程序；

3. 否则，监测“ELECTION_j"的变化，j是最大的sequence number序列号，j<i并且n_j是C里面的一个znode；

一旦受到znode删除的通知：

创造一个带有路径"ELECTION/n_" 和SEQUENCE 和EPHEMERAL 标志的znode z；
让 C 成为“ELECTION”的children，并且i成为z的序列号sequence number；
监测“ELECTION_j"的变化，j是最大的sequence number序列号，j<i并且n_j是C里面的一个znode；

官方文档里关于leader激活地址：Leader Activation

Leader activation

Leader activation includes leader election. We currently have two leader election algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the exact method of electing a leader has long as the following holds:

The leader has seen the highest zxid of all the followers.

A quorum of servers have committed to following the leader.

Of these two requirements only the first, the highest zxid amoung the followers needs to hold for correct operation. The second requirement, a quorum of followers, just needs to hold with high probability. We are going to recheck the second requirement, so if a failure happens during or after the leader election and quorum is lost, we will recover by abandoning leader activation and running another election.

After leader election a single server will be designated as a leader and start waiting for followers to connect. The rest of the servers will try to connect to the leader. The leader will sync up with followers by sending any proposals they are missing, or if a follower is missing too many proposals, it will send a full snapshot of the state to the follower.

There is a corner case in which a follower that has proposals, U, not seen by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids higher than zxids seen by the leader. The follower must have arrived after the leader election, otherwise the follower would have been elected leader given that it has seen a higher zxid. Since committed proposals must be seen by a quorum of servers, and a quorum of servers that elected the leader did not see U, the proposals of you have not been committed, so they can be discarded. When the follower connects to the leader, the leader will tell the follower to discard U.

A new leader establishes a zxid to start using for new proposals by getting the epoch, e, of the highest zxid it has seen and setting the next zxid to use to be (e+1, 0), fter the leader syncs with a follower, it will propose a NEW_LEADER proposal. Once the NEW_LEADER proposal has been committed, the leader will activate and start receiving and issuing proposals.

It all sounds complicated but here are the basic rules of operation during leader activation:

A follower will ACK the NEW_LEADER proposal after it has synced with the leader.

A follower will only ACK a NEW_LEADER proposal with a given zxid from a single server.

A new leader will COMMIT the NEW_LEADER proposal when a quorum of followers have ACKed it.

A follower will commit any state it received from the leader when the NEW_LEADER proposal is COMMIT.

A new leader will not accept new proposals until the NEW_LEADER proposal has been COMMITED.

If leader election terminates erroneously, we don't have a problem since the NEW_LEADER proposal will not be committed since the leader will not have quorum. When this happens, the leader and any remaining followers will timeout and go back to leader election.

根据这里的描述，应该是：

zookeeper目前具有两种leader election两种算法：LeaderElection和FastLeaderElection（AuthFastLeaderElection是一种FastLeaderElection选举算法的变形，利用了UDP，允许servers为了避免IP spoofing（ip欺诈）能够执行一种简单形式的认证。zookeeper messaging并不关心确切的leader的具体方法，只要满足一下条件：

leader的zxid比所有follower都高（(ZooKeeper Transaction Id,这个表明了zookeeper所有变化的顺序，每次zookeeper state的改变都有独一无二的zxid形式的stamp，如果zxid2大于zxid1，则说明zxid1发生在zxid2之前）（用这个来跟踪时间）version numbers用来跟踪当前节点的变化，服务器用ticks定义events事务的时间，比如status upload，session timeouts等，而不用真正的时间
服务器的quorum都提交给这个leader

这两个条件中，为了处理正确性只有第一个需要满足，第二个需求是尽可能满足。我们会再次核实第二个需求，使得当leader election完成以后发生故障，quorum丢失时，我们会通过丢弃leader activation而再进行选举来恢复。

经过leader election以后，一个单独的服务器将会被设定为leader并且等待其他followers服务器连接。剩下的服务器将会尝试连接leader。leader将和剩余的follower通过发送他们丢失的proposals进行同步。或者如果一个follower丢失太多proposals，他将会发送一个当前状态的full snapshot给这个follower。

有一种情况是当一个follower提出了proposals，U，但是leaders到达的时候没有看到，proposals是按照顺序出现的，所以这些U的proposals的zxids将会比leader能看到的zxids大。follower必须在leader election之后达到，否则由于它的zxid更大，follower将会被选举成leader。由于提交的proposal必须被服务器的一个qorum看到，并且被选举为leader的服务器的quorum没有看到U，剩下没有提交的proposals可以被丢弃。当follower连接到leader，leader将会告诉follower丢弃U。

一个新的leader建立一个zxid来方便新的proposals的使用，通过获取epoch，e，已经出现的最高的zxid，下一个zxid将使用(e+1,0),当leader 和一个follower完成同步以后，他将会提出一个NEW_LEADER proposal。一旦NEW_LEADER被提交以后，leader将会激活并且开始接受和发布proposals。

听起来很复杂但是以下是一些leader activation基本的规则：

当与leader同步以后，follower将会ack这个NEW_LEADRER proposal
follower将只会用一个来自一个服务器的给定的zxid 来ack一个NEW_LEADER proposal
当一个followers的一个quorum完成 ack之后，一个新的leader将会提交NEW_LEADER proposal
当NEW_LEADER proposal提交以后，一个follower将会提交所有从leader接受到的任何state
只有当NEW_LEADER proposal被提交以后，一个新的leader才会接受信的proposals

如果leader election故障结束，leader不会有qurum，那么NEW_LEADER不会被提交，不会有问题。当这种情况发生时，leader和follower会超时，然后返回leader election。

protected int electionAlg = 3;

 protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;

        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                listener.start();
                FastLeaderElection fle = new FastLeaderElection(this, qcm);
                fle.start();
                le = fle;
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

如果看这两段代码的话，确实默认选举算法是第三种FastLeaderElection。

集群中每个节点的都有一个状态：LOOKING，FOLLOWING，LEADING，OBSERVING，每个节点启动都是LOOKING状态，如果选举后没有被选为leader就是FOLLOWING，如果不参与选举就是OBSERVING，leader的状态就是LEADING

public enum ServerState {
　　LOOKING, FOLLOWING, LEADING, OBSERVING;
}

server.A=B：C：D：A是服务器的序列号，B是ip地址，c是服务器和leader连接的端口，D是万一leader挂了用来服务器之间通信，重新选举的端口。

开始选举算法前，每个节点都会在指定端口启动监听（server=127.0.0.1:20882)。这里20882就是用于选举的端口。

在fastleaderelection里面有一个manager的内部类，启动了两个线程：WorkerReceiver和WorkerSender，一个用于接受消息一个用于发送消息，两个发送和接收的逻辑是异步的。

setCurrentVote(makeLEStrategy().lookForLeader());

这里的分析结合源码和参考了这篇文章，感觉和代码比较相符

如果服务器是looking状态：

如果选举时钟大于逻辑时钟，说明这是新一轮的选举，则更新自身的逻辑时钟，验证别人的投票是否有效，是则接受别人的投票结果，否则需要更新自己的选举数据，选择最大的数据id或者leaderid，然后把选举结果广播给其他服务器。
如果选举时钟小于逻辑时钟，说明对方在一个比较早的选举进程中，无视。
最后一种情况就是两者时钟相等，调用totalOrderPredicate函数判断是否需要更新本机的选举数据，如果更新了就广播给其他服务器。

 switch (n.state) {
                    case LOOKING:
                        if (getInitLastLoggedZxid() == -1) {
                            LOG.debug("Ignoring notification as our zxid is -1");
                            break;
                        }
                        if (n.zxid == -1) {
                            LOG.debug("Ignoring notification from member with -1 zxid" + n.sid);
                            break;
                        }
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

判断服务器是否已经收到所有服务器的选举信息，如果已经接受了所有消息，接受队列里没有新的消息了，就判断自己是不是leader，如果是，就改变为leading状态，如果不是，就改变为following，结束选举。否则就阻塞一段时间finalizeWait判断是否更新leader，继续选举。

2. 如果服务器是Following或者Leading状态：考虑所有相同选举轮数的消息

如果选举时钟等于逻辑时钟，把数据保存到recvset，如果该服务器宣称自己是leader，判断是否有半数以上票数支持它，是就leading否就following，结束选举。
如果不相等，则加入到outofelection集合中，然后根据outofelection判断是否可以结束选举，如果是则同样判断自己的状态，然后结束选举。

                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                            if(termPredicate(recvset, new Vote(n.leader,
                                            n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify that
                         * a majority are following the same leader.
                         * Only peer epoch is used to check that the votes come
                         * from the same ensemble. This is because there is at
                         * least one corner case in which the ensemble can be
                         * created with inconsistent zxid and election epoch
                         * info. However, given that only one ensemble can be
                         * running at a single point in time and that each 
                         * epoch is used only once, using only the epoch to 
                         * compare the votes is sufficient.
                         * 
                         * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
                         */
                        outofelection.put(n.sid, new Vote(n.leader, 
                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
                        if (termPredicate(outofelection, new Vote(n.leader,
                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
                                && checkLeader(outofelection, n.leader, IGNOREVALUE)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                        default:
                        LOG.warn("Notification state unrecoginized: " + n.state
                              + " (n.state), " + n.sid + " (n.sid)");
                        break;
                    }

发现这个人的分析真的清楚，自己的分析简直渣渣，学习了

这里放一张他总结的流程图，厉害啊