fast paxos算法和zoopkeeper中的应用

首先,感谢雷明写的文章让我对fastpaxos算法有了一个初步的了解,联系方式leiming0571@gmail.com,该文章被分享在百度文库中,也感谢百度文库提供的平台。

原文链接

1、paxos算法

1.1、分布式一致性问题

在分布式系统中,有多个节点(服务器),这些节点可以为客户(客户端)提供某种服务。这至少带来两个好处:一方面可以提供性能,另一方面可以提供容错,当部分服务器挂掉之后,不影响整个服务。

但是问题又来了:如果多个服务器节点修改同一个变量,后果会怎么样?

 

1.2、paxos算法

paxos算法是一个基于消息传递的一致性算法,在1990年由Lamport提出,最近被广泛的应用于分布式计算中。Google的Chubby,Apache的Zookeeper都是基于该理论实现。Paxos被认为是到目前为止唯一正确的分布式一致性算法,其他的算法都是Paxos的改进或者简化。

 

1.3、fast paxos算法

问题提出:在分布式系统中,需要选举出一个leader节点,只有该节点可以做出决策。问题:怎么在n个节点中选择一个作为leader?

方法一:指定某一个节点作为leader,问题:这个leader挂了怎么办?

         方法二:给每个节点指定一个唯一的id,比较所有节点的id,id最大的为leader,问题:怎么知道活着的id中,哪个是最大的?

        

         要解决的问题:

         1、什么时候发起选举?

         2、选举过程中,后加入的节点怎么办?

         3、一个节点挂掉后重启,要发起选举,怎么办?

        

         选举轮数的新旧:

         每个节点的数据的新旧(选举结果)

         有多个节点都活着,在同一轮选举中,选择哪个节点作为leader?


2、zookeeper的选举算法

2.1、概述

Zookeeper集群中,只有一个节点是leader节点,其他节点都是follower节点(实际上还有observer节点,不参与选举的投票,在这里我们先忽略,下同)。所有的更新操作,必须经过leader节点,leader节点和follower节点之间保持着数据同步和心跳。

客户端使用zookeeper时,可能会连到follower身份的server上,也可能会连到leader身份的server上。

三类角色的分工如下:

Leader:处理写请求,单点

Follower:处理客户端请求,参与投票

Observer:不参与leader选举的投票,只处理客户端请求

在一个zookeeper集群里,有多少个server是固定的,每个节点有一个唯一的id,标识它自己,另外,每个server还有用于选举的IP和port,这些都在配置文件中。一个具体的例子如下:

server.1=192.168.0.11:2888:3888

server.2=192.168.0.12:2888:3888

server.2=192.168.0.13:2888:3888

这里有3个server,其id分别为1、2、3。2888为节点和leader交换信息的端口,3888为选举的端口。这个节点的id,在投票时,用户标识参加竞选的节点的身份。

问题:这个leader节点是怎么确定的?

答案:zookeeper系统自己选举出来的,所有的server节点(observer除外),都参与这个选举。这样做的好处是:当现在的leader挂掉了之后,系统可以重新选举一个节点做leader。

Zookeeper的选举算法能保证:只要超过半数的节点还活着,就一定能选举出唯一个一个节点作为leader。

 

2.1.1、节点的状态

         Zookeeper中的节点有以下三种状态(忽略observer节点):

         LOOKING:初始化状态,处于选举过程中,leader还没有选出

         LEADING:leader已经选出,本节点是leader

         FOLLOWING:leader已经选出,本节点是follower

 

2.1.2、选举发生的时机

         当任何一个节点进入looking状态时,选举开始,进入looking状态有如下原因:

         1、节点刚启动,使自己进入选举状态

         2、发现leader节点挂掉了

Zookeeper中的leader怎么知道follower还活着?follower怎么知道leader还活着?leader会定时向follower发ping消息;follower会定时向leader发ping消息。当发现无法ping通leader时,就会将自己的状态改为LOOKING,并发起新的一轮选举。处于选举模式时,zookeeper的服务不可用。

 

2.1.3、一个节点成为leader的条件

         一个节点要成为leader,必须得到至少n/2+1(即半数以上节点)的投票,实际上,在实现时,还可以考虑其他规则,比如节点权重。

         为什么要保证至少n/2+1的节点同意?因为这样能保证本节点得到多数派的支持。因为每一个节点,只能支持一个节点成为leader,因此,只要一个节点获得至少n/2+1的选票,就一定会比其他任何节点得到的选票多。

         这个规则意味着,如果超过半数以上的节点挂掉,zookeeper是选举不出leader节点的,因此,zookeeper集群最多允许n/2的节点故障。

 

2.1.4、要解决的问题

         选举算法的目标是确保一定要选出一个唯一的leader节点。这有两层含义:

         1、一定要选出一个节点作为leader

         2、这个leader一定要唯一

         为此,要解决如下问题:

1、在一次选举中,节点应该把票投给谁?

         规则:每个节点有一个唯一的id,在选举中,节点总是把票投给id最大的那个节点,这样,id大的节点更有可能成为leader,天生就是做领导的料。

         2、在一次选举的过程中,有些节点由于没有启动而没参加(有些人去国外了,没有赶上这次大选,当他回国后,进入looking状态,要发起选举,怎么办?),后来这个节点启动了,此时要求选举,怎么解决?

         3、运行过程中,leader节点挂掉了,怎么办?

         此时其他节点会发现leader挂了,会发起新一轮选举,最后选出新的leader。

        

2.1.5、尝试的解决方案

         1、直接指定一个节点做leader,例如,永远都让id最大节点当leader,这个想法最简单。问题:这个节点挂了怎么办?这会出现单点问题。

         2、每次选举中,让活着的节点中,id最大的节点当leader。问题:1、其他节点怎么知道活着的节点中,谁的id最的?

 


 

2.2、源代码分析

         zookeeper的leader选举源代码在:

src\java\main\org\apache\zookeeper\server\quorum

目录下。zookeeper实现了3种选举算法,在这里我们只介绍默认的FastLeaderElection算法,其源代码在FastLeaderElection.java中。

        

2.2.1、选举算法的流程

         选举开始时,每个节点为自己生成一张投票,推荐自己成为leader,并把投票发送给其他节点,这相当于paxos算法中的proposer角色。接下来,节点启动一个接收线程接收其他节点发送过来的投票,并对选票进行处理,这相当于paxos中的acceptor角色。简单的说,节点之间通过这种消息发送(投票),最终选举出leader。

         当收到其他节点的选票之后,会和自己的投票比较,如果比自己的投票好(比如推荐的leader的id更大,选举轮数更新),则更新自己的选票,接下来把收到的选票放在选票列表里(该列表存储了所有节点的投票,是一个key-value结构,key为节点的id,value为该节点的投票)。并再次把自己的投票发送给其他节点。

         接下来节点会统计选票列表中每个节点获得的票数,如果有一个节点i获得超过半数的选票,则认为该节点是leader。如果本节点就是i,则将自身的状态置为leading,表明自己是leader;否则将自己的状态置为following,表明自己是follower。

         通过若干轮的消息交换,最终,会有一个节点获得超过一半的选票而成为leader。这种方法的精髓在于,每个节点在不需要获得所有节点的信息(投票结果)的前提下,达成一致意见,选出leader。

        

2.2.2、主要数据结构

Notification

Notification是一个节点从其他节点收到的投票信息,一个节点投票发生变化时会向其他节点发送通知,这分为以下几种情况:

         1、本节点刚加入到选举过程中

         2、本节点收到了另外一个节点的消息,这个节点后更大的zxid或者zxid相等但是节点id更大,也就是说,节点更新了自己的投票,变成投这个节点。

static publicclass Notification {

                   longleader; // 发送者推荐的leader的节点id

                   longzxid; // 发送者推荐的leader的zxid

                   longepoch; // 发送者的选举轮数

                   QuorumPeer.ServerStatestate; // 发送者的状态

                   InetSocketAddressaddr; // 发送者的ip地址

         }

 

ToSend

         ToSend用于在本节点的投票发生变化时,向其他节点发送消息。

         staticpublic class ToSend {

                   inttype;

                   longleader;

                   longzxid;

                   longepoch;

                   QuorumPeer.ServerStatestate;

                   longtag;

                   InetSocketAddressaddr;

         }

        

2.2.3、发送队列和接收队列

         发送线程用于向其他节点发送本节点的投票信息,接受线程用于从其他节点接受投票信息。

         LinkedBlockingQueue<ToSend>sendqueue;

         LinkedBlockingQueue<Notification>recvqueue;

         第一个队列是发送选票的队列,用于向其他节点发送本节点的选票消息;第二个队列是接收队列,用于接收其他节点发来的选票消息。

 

2.2.4、收票箱

         HashMap<Long,Vote> recvset = new HashMap<Long, Vote>();

         这个用于存储本节点收到的来自其他节点的选票。

         HashMap<Long,Vote> outofelection = new HashMap<Long, Vote>();

         这个用于存储本节点从其他following、leading状态的节点收到的选票,用于对leader进行检查确认。

 

2.2.5、重要的函数

totalOrderPredicate

函数totalOrderPredicate用于比较两张选票的大小,具体代码如下:

protected booleantotalOrderPredicate(long newId, long newZxid, long newEpoch,

long curId, long curZxid, long curEpoch){

                   if(self.getQuorumVerifier().getWeight(newId) == 0){

                            returnfalse;

                   }

                

                   return((newEpoch > curEpoch) ||

                              (newEpoch == curEpoch && newZxid >curZxid) ||

                              (newZxid == curZxid && newId >curId));

         }

         比较的规则为:

         1、首先比较epoch,即逻辑时钟,这个大的,选票就大。

         2、接着比较zxid,事务号,这个大的,选票就大。

         3、最后比较节点的id,id大的,选票就大。

        

updateProposal

         函数updateProposal更新本地节点的选票,具体代码如下:

synchronized voidupdateProposal(long leader, long zxid, long epoch){

                   proposedLeader= leader;

                   proposedZxid= zxid;

                   proposedEpoch= epoch;

         }

         更新的内容包括:本节点推荐的leader的id,推荐的leader所持有的zxid值,推荐的leader所持有的epoch值。一般的,在更新本节点的选票之后,会调用sendNotifications向其他节点广播本节点的选票。

        

sendNotifications

         函数sendNotifications用于向其他节点发送自己的投票信息,实际上是把投票放到sendqueue,发送队列中,本节点的选票变化时(收到来自其他节点更好的选票,或者选举初始时),会调用此函数:

         privatevoid sendNotifications(){

                   for(QuorumServer server : self.getVotingView().values()) {

                            longsid = server.id;

                            ToSendnotmsg = new ToSend(ToSend.mType.notification,

                                     proposedLeader,

                                     proposedZxid,

                                     logicalclock,

                                     QuorumPeer.ServerState.LOOKING,

                                     sid,

                                     proposedEpoch);

                            sendqueue.offer(notmsg);

                   }

         }

         流程很简单,就是为每个节点构造一个ToSend消息,里面的字段包括:

         mType:消息的类型,在这里为notification,即向其他节点发送通知

         proposedLeader:本节点推荐的leader的id

         logicalclock:本节点的逻辑时钟

         ServerState:本节点的状态,在这里为LOOKING

         sid:本节点的id

         proposedEpoch:本节点推荐的逻辑时钟

         然后,调用sendqueue的offer反思把消息放到本节点的待发送消息队列中。

        

getInitId

         函数getInitId用于获取本节点的id,如果是PARTICIPANT,也就是参与投票的节点, 则为文件中配置的那个id,否则为Long.MIN_VALUE,整数的最小值,表明自己不参加投票。具体代码如下:

private longgetInitId() {

                   if(self.getLearnerType() == LearnerType.PARTICIPANT)

                            returnself.getId();

                   elsereturn Long.MIN_VALUE;

         }

        

termPredicate

         函数termPredicate用于判断选举是否结束,即投票vote是否获得超过半数节点的支持。传入参数为本节点的收票箱(这是一个key-value结构,key为节点id,value那个节点的投票)、本节点当前的投票(要判断是否超过半数的投票)。具体代码如下:

         protectedboolean termPredicate(HashMap<Long, Vote> votes, Vote vote) {

                   HashSet<Long>set = new HashSet<Long>();

                   for(Map.Entry<Long, Vote> entry : votes.entrySet()) {

                            if(vote.equals(entry.getValue())) {

                                     set.add(entry.getKey());

                            }

                   }

                   returnself.getQuorumVerifier().containsQuorum(set);

         }

         处理流程:

         1、遍历收到的选票的列表HashMap,如果选票和本节点的选票相同,则放入一个临时的set中。

         2、判断set中的选票是否满足多数条件。containsQuorum在这里的实现是判断其size是否超过总节点数的一半。具体代码如下:

         publicinterface QuorumVerifier {

                   longgetWeight(long id);

                   booleancontainsQuorum(HashSet<Long> set);

         }

         publicboolean containsQuorum(HashSet<Long> set) {

                   returnset.size() > half;

         }

         可以看到,就是比较set的size是否超过总节点数的一半half。

 

ooePredicate

         函数ooePredicate用于判断一个leader节点是否被选出,这包括两个条件:

         1、满足termPredicate条件,也就是说得到超过半数的选票

         2、满足checkLeader条件

         具体代码如下:

         protectedboolean ooePredicate(HashMap<Long,Vote> recv,

                                   HashMap<Long,Vote> ooe,

                                   Notification n) {

       

       return (termPredicate(recv, new Vote(n.version,

                                            n.leader,

                                            n.zxid,

                                            n.electionEpoch,

                                            n.peerEpoch,

                                            n.state))

                && checkLeader(ooe,n.leader, n.electionEpoch));

}

 

checkLeader

         函数checkLeader用于判断一个leader节点是否真的可以做leader。当一个leader节点被选出,并且大多数节点支持该节点做leader后,还需要检查该节点是否投过票,并且向其他节点回复自己已经处于leading状态。这样做的目的是:避免节点反复的选举一个已经崩溃了并且已经不处于leading状态的节点做leader。

具体代码如下:

         protectedboolean checkLeader(

           HashMap<Long, Vote> votes,

           long leader,

           long electionEpoch){

                  

       boolean predicate = true;

                  

       if(leader != self.getId()){

           if(votes.get(leader) == null) predicate = false;

           else if(votes.get(leader).getState() != ServerState.LEADING) predicate =false;

       } else if(logicalclock != electionEpoch) {

           predicate = false;

       }

                  

       return predicate;

}

处理流程如下:

1、如果本节点不是被选出的leader:

                   1、如果本节点没有收到来自被选举为leader的节点的投票,验证失败

                   2、如果本节点收到了来自被选举为leader的节点的投票,但是该节点的状态不是LEADING,验证失败

         2、如果本节点是被选出的leader,但是本节点的逻辑时钟不等于electionEpoch,验证失败。

         可以看到,首先判断了本节点是否是被选举为leader的节点。如果不是,则要求本节点收到了来自被选举为leader的节点的表明自己处于leading状态的投票;否则,就比较笨节点的逻辑时钟和electionEpoch,如果不相等,说明本节点处在上一轮选举的leading状态中,这个状态已经过时了。

 

connectAll

         connectAll用于连接所有其他节点,代码如下:

         publicvoid connectAll(){

       long sid;

       for(Enumeration<Long> en = queueSendMap.keys();

           en.hasMoreElements();){

           sid = en.nextElement();

           connectOne(sid);

       }     

}

        

connectOne

         connectOne用于连接一个节点,代码如下:

synchronizedvoid connectOne(long sid){

        if (senderWorkerMap.get(sid) == null){

            InetSocketAddress electionAddr;

            if (self.quorumPeers.containsKey(sid)){

                electionAddr =self.quorumPeers.get(sid).electionAddr;

            } else {

                LOG.warn("Invalid serverid: " + sid);

                return;

            }

            try {

 

                if (LOG.isDebugEnabled()) {

                    LOG.debug("Openingchannel to server " + sid);

                }

                Socket sock = new Socket();

                setSockOpts(sock);

               sock.connect(self.getView().get(sid).electionAddr, cnxTO);

                if (LOG.isDebugEnabled()) {

                    LOG.debug("Connectedto server " + sid);

                }

                initiateConnection(sock, sid);

            } catch (UnresolvedAddressExceptione) {

                // Sun doesn't include theaddress that causes this

                // exception to be thrown, alsoUAE cannot be wrapped cleanly

                // so we log the exception inorder to capture this critical

                // detail.

                LOG.warn("Cannot openchannel to " + sid

                        + " at electionaddress " + electionAddr, e);

                throw e;

            } catch (IOException e) {

                LOG.warn("Cannot openchannel to " + sid

                        + " at electionaddress " + electionAddr,

                        e);

            }

        } else {

            LOG.debug("There is aconnection already for server " + sid);

        }

    }

booleanhaveDelivered() {

        for(ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) {

            LOG.debug("Queue size: "+ queue.size());

            if (queue.size() == 0) {

                return true;

            }

        }

 

        return false;

 }

 

2.2.6、选举线程的处理流程

         函数lookForLeader用于选举新的leader(选举线程),每当QuorumPeer即一个节点进入LOOKING状态时(节点刚启动时,或者发现leader挂掉时),就会调用此函数,开始选举流程。具体的流程图如下:

        

         首先,构造两个收票箱recvset和outofelection,前者用于接收所有来自其他节点的本轮投票选票;后者用于接收来自FOLLOWING、LEADING状态的节点的选票。

         接下来,将自己的逻辑时钟加1,logicalclock++,并且调用updateProposal为自己生产初始选票,初始时,都是选自己做leader。

         然后,调用sendNotifications将本节点的选票放到发送队列,让发送线程广播给其他节点。

         接下来,进入循环,直到本节点不再处于LOOKING,并且终止标志stop没有被置位:

         1、从接收队列recvqueue中取出一张选票,这由分为两种情况:

         没有取到,说明当前没有收到来自其他节点的选票,此时再接着判断发送队列有没有选票,没有,则将本节点的选票放入队列,发送出去;否则,连接所有节点。接下来,调整接收选票的超时时间,这里使用了指数递增算法,每次的超时时间在上一次的基础上乘以2,但是,如果达到超时时间的上限,则做截断处理,直接置为上限。

         2、如果取到选票,则接着判断对方节点的状态,这分三种情况:

         2.1、LOOKING

         这表明,对方节点处于looking状态,还没有选出他认为的leader,接下来就要比较逻辑时钟,这又有三种情况:

         2.1.1、本地时钟比对方的小,这说明本节点处于上一轮的选举中,这种情况下要做的事情是:

         更新本地时钟:logicalclock = n.electionEpoch;

         将本地的收票箱清空,因为这里面已有的数据,是之前的轮的投票,已经过期了:

         recvset.clear();

         比较本地默认选票和对方选票的大小。

         将本节点的选票放入发送队列,广播给其他节点。

         2.1.2、本地时钟比对方大。这说明对方处于上一轮的选举中,此时不做任何处理,直接忽略,并且跳入下一次循环,继续收选票。

         2.1.3、两者的时钟相等。这说明双方在同一轮选举中,此时比较双方选票的大小,并更新选票,然后将本节点的选票放入发送队列中广播给其他节点。

         接下来,将收到的选票放入本节点的收票箱中。

         然后,判断选举是否可以结束,也就是termPredicate条件是否满足。

         跳入下一次循环,继续收选票。

         2.2、OBSERVING

         这表明对方节点不是参加投票的节点,直接忽略。

2.3、FOLLOWING/LEADING

这表明,对方节点已经认为他选出了leader。

判断双方的逻辑时钟是否相等,如果相等:

将选票放入本节点的收票箱中

 

具体代码如下:

         publicVote lookForLeader() throws InterruptedException {

       try {

           self.jmxLeaderElectionBean = new LeaderElectionBean();

           MBeanRegistry.getInstance().register(

                    self.jmxLeaderElectionBean,self.jmxLocalPeerBean);

       } catch (Exception e) {

           LOG.warn("Failed to register with JMX", e);

           self.jmxLeaderElectionBean = null;

       }

       if (self.start_fle == 0) {

          self.start_fle = System.currentTimeMillis();

       }

       try {

           HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

 

           HashMap<Long, Vote> outofelection = new HashMap<Long,Vote>();

 

           int notTimeout = finalizeWait;

 

           synchronized(this){

                logicalclock++;

                updateProposal(getInitId(),getInitLastLoggedZxid(), getPeerEpoch());

           }

 

           LOG.info("New election. My id = " + self.getId() +

                    ", proposedzxid=0x" + Long.toHexString(proposedZxid));

           sendNotifications();

 

           /*

            * Loop in which we exchange notifications until we find a leader

            */

 

           while ((self.getPeerState() == ServerState.LOOKING) &&

                    (!stop)){

                /*

                 * Remove next notificationfrom queue, times out after 2 times

                 * the termination time

                 */

                Notification n =recvqueue.poll(notTimeout,

                        TimeUnit.MILLISECONDS);

 

                /*

                 * Sends more notifications ifhaven't received enough.

                 * Otherwise processes newnotification.

                 */

                if(n == null){

                   if(manager.haveDelivered()){

                        sendNotifications();

                    } else {

                        manager.connectAll();

                    }

 

                   /*

                     * Exponential backoff

                     */

                    int tmpTimeOut =notTimeout*2;

                    notTimeout = (tmpTimeOut< maxNotificationInterval?

                            tmpTimeOut :maxNotificationInterval);

                    LOG.info("Notificationtime out: " + notTimeout);

                }

                elseif(self.getVotingView().containsKey(n.sid)) {

                    /*

                     * Only proceed if the votecomes from a replica in the

                     * voting view.

                     */

                    switch (n.state) {

                    case LOOKING:

                        // If notification >current, replace and send messages out

                        if (n.electionEpoch> logicalclock) {

                            logicalclock =n.electionEpoch;

                            recvset.clear();

                           if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,

                                    getInitId(), getInitLastLoggedZxid(),getPeerEpoch())) {

                               updateProposal(n.leader, n.zxid, n.peerEpoch);

                            } else {

                               updateProposal(getInitId(),

                                        getInitLastLoggedZxid(),

                                       getPeerEpoch());

                            }

                           sendNotifications();

                        } else if(n.electionEpoch < logicalclock) {

                           if(LOG.isDebugEnabled()){

                               LOG.debug("Notification election epoch is smaller thanlogicalclock. n.electionEpoch = 0x"

                                        +Long.toHexString(n.electionEpoch)

                                       +", logicalclock=0x" + Long.toHexString(logicalclock));

                            }

                            break;

                        } else if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,

                                proposedLeader, proposedZxid,proposedEpoch)) {

                           updateProposal(n.leader, n.zxid, n.peerEpoch);

                           sendNotifications();

                        }

 

                        if(LOG.isDebugEnabled()){

                           LOG.debug("Adding vote: from=" + n.sid +

                                    ",proposed leader=" + n.leader +

                                    ",proposed zxid=0x" + Long.toHexString(n.zxid) +

                                    ", proposed electionepoch=0x" + Long.toHexString(n.electionEpoch));

                        }

 

                        recvset.put(n.sid, newVote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

 

                        if (termPredicate(recvset,

                                newVote(proposedLeader, proposedZxid,

                                       logicalclock, proposedEpoch))) {

 

                            // Verify if thereis any change in the proposed leader

                            while((n = recvqueue.poll(finalizeWait,

                                   TimeUnit.MILLISECONDS)) != null){

                               if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,

                                       proposedLeader, proposedZxid, proposedEpoch)){

                                   recvqueue.put(n);

                                    break;

                                }

                            }

 

                            /*

                             * This predicateis true once we don't read any new

                             * relevant messagefrom the reception queue

                             */

                            if (n == null) {

                               self.setPeerState((proposedLeader == self.getId()) ?

                                       ServerState.LEADING: learningState());

 

                                Vote endVote =new Vote(proposedLeader,

                                                       proposedZxid,

                                                        logicalclock,

                                                       proposedEpoch);

                               leaveInstance(endVote);

                                return endVote;

                           }

                        }

                        break;

                    case OBSERVING:

                       LOG.debug("Notification from observer: " + n.sid);

                        break;

                    case FOLLOWING:

                    case LEADING:

                        /*

                         * Consider allnotifications from the same epoch

                         * together.

                         */

                        if(n.electionEpoch ==logicalclock){

                            recvset.put(n.sid,new Vote(n.leader,

                                                         n.zxid,

                                                         n.electionEpoch,

                                                         n.peerEpoch));

                          

                           if(ooePredicate(recvset, outofelection, n)) {

                               self.setPeerState((n.leader == self.getId()) ?

                                       ServerState.LEADING: learningState());

 

                                Vote endVote =new Vote(n.leader,

                                        n.zxid,

                                       n.electionEpoch,

                                       n.peerEpoch);

                               leaveInstance(endVote);

                                return endVote;

                            }

                        }

 

                        /*

                         * Before joining anestablished ensemble, verify

                         * a majority isfollowing the same leader.

                         */

                       outofelection.put(n.sid, new Vote(n.version,

                                                           n.leader,

                                                           n.zxid,

                                                           n.electionEpoch,

                                                           n.peerEpoch,

                                                           n.state));

          

                       if(ooePredicate(outofelection, outofelection, n)) {

                            synchronized(this){

                                logicalclock =n.electionEpoch;

                                self.setPeerState((n.leader== self.getId()) ?

                                       ServerState.LEADING: learningState());

                            }

                            Vote endVote = newVote(n.leader,

                                                    n.zxid,

                                                   n.electionEpoch,

                                                   n.peerEpoch);

                           leaveInstance(endVote);

                            return endVote;

                       }

                        break;

                    default:

                       LOG.warn("Notification state unrecognized: {} (n.state), {}(n.sid)",

                                n.state,n.sid);

                        break;

                    }

                } else {

                    LOG.warn("Ignoringnotification from non-cluster member " + n.sid);

                }

           }

           return null;

       } finally {

           try {

                if(self.jmxLeaderElectionBean!= null){

                   MBeanRegistry.getInstance().unregister(

                           self.jmxLeaderElectionBean);

                }

           } catch (Exception e) {

                LOG.warn("Failed tounregister with JMX", e);

           }

           self.jmxLeaderElectionBean = null;

       }

}

         

2.2.7、重要的变量

logicalclock

volatile longlogicalclock;

         表示选举的轮数,在lookForLeader开始的时候会加1,,另外,在收到其他节点的投票信息时,如果其他节点的electionEpoch比本值大,本值会被赋成electionEpoch。也就是说,每次节点启动时,该值为0?这个值只在节点存活的时候有意义?即节点重启后,该值为0。

 

proposedLeader

         longproposedLeader;

         该值为本节点推荐的leader的id,初始时为自己,后面会更新,这个值不会从文件中读,也就是说,重启后会自动使用本节点的id。getInitId源代码如下:

         publiclong getId() {

       return myid;

    }

 

proposedZxid

         longproposedZxid;

         本节点建议的zxid,在starter函数中,被初始化为-1;在updateProposal函数中,会更新该变量的值。

         updateProposal(getInitId(),getInitLastLoggedZxid(), getPeerEpoch());

         privatelong getInitLastLoggedZxid(){

       if(self.getLearnerType() == LearnerType.PARTICIPANT)

           return self.getLastLoggedZxid();

       else return Long.MIN_VALUE;

}

public longgetLastLoggedZxid() {

    if (!zkDb.isInitialized()) {

           loadDataBase();

        }

        returnzkDb.getDataTreeLastProcessedZxid();

}

 

proposedEpoch

         longproposedEpoch;

         表示本节点推荐的选举轮数,在updateProposal函数更新选票时,会更新该值。节点启动初始化的时候,第一次调用updateProposal,会把proposedEpoch的值赋为getPeerEpoch,而该函数又会调用getCurrentEpoch,getCurrentEpoch的代码如下:

         publiclong getCurrentEpoch() throws IOException {

                   if(currentEpoch == -1) {

                            currentEpoch= readLongFromFile(CURRENT_EPOCH_FILENAME);

                   }

                   returncurrentEpoch;

         }

         这表明,该值会从日志文件中读出来。也就是说,节点重启后,会使用上次活着的时候的值。

 

         为什么有了zxid还需要epoch?zxid是用来表示数据的新旧,而epoch是用来表示选举的轮数。

        

         startLeaderElection

         QuorumPeer的成员函数

         synchronizedpublic void startLeaderElection() {

                   try{

                            currentVote= new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());

                   }catch (IOException e) {

                            RuntimeExceptionre = new RuntimeException(e.getMessage());

                            Re.setStackTrace(e.getStackTrace());

                            throwre;

                   }

                   for(QuorumServer p : getView().values()) {

                            if(p.id == myid) {

                                     myQuorumAddr= p.addr;

                                     break;

                            }

                   }

                   if(myQuorumAddr == null) {

                            thrownew RuntimeException(“My id” + myid + “not in the peer list”);

                   }

                   if(electionType == 0) {

                            try{

                                     udpSocket= new DatagramSocket(myQuorumAddr.getPort());

                                     responder= new ResponderThread();

                                     responder.start();

                            }catch (SocketException e) {

                                     thrownew RuntimeException(e);

                            }

                   }

                   this.electionAlg= createElectionAlgorithm(electionType);

         }

        

         protectedElection createElectionAlgorithm(int electionAlgorithm){

       Election le=null;

               

       //TODO: use a factory rather than a switch

       switch (electionAlgorithm) {

       case 0:

           le = new LeaderElection(this);

           break;

       case 1:

           le = new AuthFastLeaderElection(this);

           break;

       case 2:

           le = new AuthFastLeaderElection(this, true);

           break;

       case 3:

           qcm = new QuorumCnxManager(this);

           QuorumCnxManager.Listener listener = qcm.listener;

           if(listener != null){

                listener.start();

                le = newFastLeaderElection(this, qcm);

           } else {

                LOG.error("Null listenerwhen initializing cnx manager");

           }

           break;

       default:

           assert false;

       }

       return le;

    }

        

        

2.2.8、发送线程和接收线程

WorkerSender

         WorkerReceiver是消息接收线程,继承自Runnable。该线程的作用四不停的接收其他节点发送过来的投票信息,然后放入recvqueue中。

         该类的成员:

         volatileboolean stop;

         该变量控制线程的退出

         QuorumCnxManagermanager;

         投票选举的管理器。

         run方法用于接收消息,并且放入接收队列里

volatile booleanstop;

         publicvoid run() {

                while (!stop) {

                    try {

                        ToSend m =sendqueue.poll(3000, TimeUnit.MILLISECONDS);

                        if(m == null) continue;

 

                        process(m);

                    } catch(InterruptedException e) {

                        break;

                    }

                }

                LOG.info("WorkerSender isdown");

}

 

voidprocess(ToSend m) {

                ByteBuffer requestBuffer =buildMsg(m.state.ordinal(),

                                                       m.leader,

                                                       m.zxid,

                                                       m.electionEpoch,

                                                       m.peerEpoch);

                manager.toSend(m.sid,requestBuffer);

            }

 

WorkerReceiver

WorkerReceiver线程是数据接收线程,用于从其他节点接收选举信息。

public voidrun() {

 

                Message response;

                while (!stop) {

                    // Sleeps on receive

                    try{

                        response =manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);

                        if(response == null)continue;

 

                        /*

                         * If it is from anobserver, respond right away.

                         * Note that thefollowing predicate assumes that

                         * if a server is not afollower, then it must be

                         * an observer. If we ever have anyother type of

                         * learner in thefuture, we'll have to change the

                         * way we check forobservers.

                         */

                        if(!self.getVotingView().containsKey(response.sid)){

                            Vote current =self.getCurrentVote();

                            ToSend notmsg = newToSend(ToSend.mType.notification,

                                   current.getId(),

                                    current.getZxid(),

                                   logicalclock,

                                   self.getPeerState(),

                                   response.sid,

                                   current.getPeerEpoch());

 

                           sendqueue.offer(notmsg);

                        } else {

                            // Receive newmessage

                            if(LOG.isDebugEnabled()) {

                               LOG.debug("Receive new notification message. My id = "

                                        +self.getId());

                            }

 

                            /*

                             * We check for 28bytes for backward compatibility

                             */

                            if(response.buffer.capacity() < 28) {

                               LOG.error("Got a short response: "

                                        +response.buffer.capacity());

                                continue;

                            }

                            booleanbackCompatibility = (response.buffer.capacity() == 28);

                           response.buffer.clear();

 

                            // InstantiateNotification and set its attributes

                            Notification n = newNotification();

                           

                            // State of peerthat sent this message

                           QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;

                            switch(response.buffer.getInt()) {

                            case 0:

                                ackstate =QuorumPeer.ServerState.LOOKING;

                                break;

                            case 1:

                                ackstate =QuorumPeer.ServerState.FOLLOWING;

                                break;

                            case 2:

                                ackstate =QuorumPeer.ServerState.LEADING;

                                break;

                            case 3:

                                ackstate =QuorumPeer.ServerState.OBSERVING;

                                break;

                            default:

                                continue;

                            }

                           

                            n.leader =response.buffer.getLong();

                            n.zxid =response.buffer.getLong();

                            n.electionEpoch =response.buffer.getLong();

                            n.state = ackstate;

                            n.sid =response.sid;

                           if(!backCompatibility){

                                n.peerEpoch =response.buffer.getLong();

                            } else {

                                if(LOG.isInfoEnabled()){

                                   LOG.info("Backward compatibility mode, server id=" + n.sid);

                                }

                                n.peerEpoch =ZxidUtils.getEpochFromZxid(n.zxid);

                            }

 

                            /*

                             * Version added in3.4.6

                             */

 

                            n.version =(response.buffer.remaining() >= 4) ?

                                         response.buffer.getInt() : 0x0;

 

                            /*

                             * Printnotification info

                             */

                           if(LOG.isInfoEnabled()){

                                printNotification(n);

                            }

 

                            /*

                             * If this serveris looking, then send proposed leader

                             */

 

                           if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){

                               recvqueue.offer(n);

 

                                /*

                                 * Send anotification back if the peer that sent this

                                 * message isalso looking and its logical clock is

                                 * laggingbehind.

                                 */

                                if((ackstate ==QuorumPeer.ServerState.LOOKING)

                                       && (n.electionEpoch < logicalclock)){

                                    Vote v =getVote();

                                    ToSendnotmsg = new ToSend(ToSend.mType.notification,

                                           v.getId(),

                                            v.getZxid(),

                                           logicalclock,

                                           self.getPeerState(),

                                           response.sid,

                                           v.getPeerEpoch());

                                   sendqueue.offer(notmsg);

                                }

                            } else {

                                /*

                                 * If thisserver is not looking, but the one that sent the ack

                                 * is looking,then send back what it believes to be the leader.

                                 */

                                Vote current =self.getCurrentVote();

                                if(ackstate ==QuorumPeer.ServerState.LOOKING){

                                   if(LOG.isDebugEnabled()){

                                       LOG.debug("Sending new notification. My id =  " +

                                               self.getId() + " recipient=" +

                                               response.sid + " zxid=0x" +

                                               Long.toHexString(current.getZxid()) +

                                               " leader=" + current.getId());

                                    }

                                   

                                    ToSendnotmsg;

                                   if(n.version > 0x0) {

                                        notmsg= new ToSend(

                                                ToSend.mType.notification,

                                               current.getId(),

                                               current.getZxid(),

                                               current.getElectionEpoch(),

                                               self.getPeerState(),

                                               response.sid,

                                               current.getPeerEpoch());

                                       

                                    } else {

                                        VotebcVote = self.getBCVote();

                                        notmsg= new ToSend(

                                               ToSend.mType.notification,

                                                bcVote.getId(),

                                               bcVote.getZxid(),

                                               bcVote.getElectionEpoch(),

                                               self.getPeerState(),

                                               response.sid,

                                               bcVote.getPeerEpoch());

                                    }

                                   sendqueue.offer(notmsg);

                                }

                            }

                        }

                    } catch(InterruptedException e) {

                       System.out.println("Interrupted Exception while waiting for newmessage" +

                                e.toString());

                    }

                }

                LOG.info("WorkerReceiveris down");

            }

        }


 

2.3、运行实例

         假设zookeeper集群中有3个节点,其ID分别为1、2、3。整个集群开始运行时,每个节点的zxid都为1。

         1、节点1、2、3启动后,都进入looking状态,开始leader选举。每个节点的proposedLeader即推荐的leader都是自己;logicalclock值都为1;建议的proposedZxid值都为1;建议的proposedEpoch值都为1;投票列表为每个节点投自己的一票(1->1,2->2,3->3)。节点1首先向2、3发送自己的投票消息:

         2、节点2、3收到节点1的投票消息,首先查看1的状态,发现1处于looking状态。接下来,判断1发来的electionEpoch和本地逻辑时钟logicalclock的大小,发现两者相等(都为1)。接着判断leader、zxid、peerEpoch和本地proposedLeader、proposedZxid、proposedEpoch的大小,节点2发现节点1推荐的leader的id比自己小(1<2),节点3也发现节点1推荐的leader的id比自己的小(1<3),因此不用更新自己的投票。接下来,节点2、3把节点1的投票放入自己的投票列表中,这样,节点2收到的投票的列表为:

         1->1

         2->2

         节点3的为:

         1->1

         3->3

         节点2、3再判断此次投票是否可以结束,发现不能结束。如下图所示:

         3、节点2向节点1、3发送自己的投票信息,节点3由于发送线程的故障原因,投票信息一直没有出去:

        

         在2发出的投票信息中,选择的leader是它自己。

         3、节点1、3收到节点2的投票消息。节点1比较自己的logcalclock和节点2发来的electionEpoch的大小,二者相等,接下来比较leader、zxid、peerEpoch和本地proposedLeader、proposedZxid、proposedEpoch的大小,发现节点2推荐的leader的id(2)比自己的proposedLeader(1)大,于是更新自己的选票,将proposedLeader改为2。然后,节点1将2的选票(2->2)放入自己收到的投票箱中,接着判断投票是否可以结束(调用函数termPredicate),由于节点2被超过半数的节点选择(1、2),因此选举可以结束,由于自己不是leader,节点1将自己的状态改为following。

         节点3比较自己的logcalclock和节点2发来的electionEpoch的大小,二者相等,接下来比较leader、zxid、peerEpoch和本地proposedLeader、proposedZxid、proposedEpoch的大小,发现节点2推荐的leader的id(2)比自己的proposedLeader(3)小,不用更新自己的选票。然后,节点3将2的选票(2->2)放入自己收到的投票箱中,接着判断投票是否可以结束(调用函数termPredicate),由于没有节点获得超过半数的选票,因此选举继续。

         4、节点1收到节点2的选票,更新选票后,再向节点1、3发送自己的投票信息:

         此时,节点1选的leader已经变为2,而且节点1的状态已经变成following。

         5、节点2在收到节点1的选票信息后,判断节点1的状态,发现为following,这表明,节点1已经认为leader选出来了,并且是2。节点2首先更新自己的收票箱,将1的投票改为2,接着,判断选举是否结束,发现确实可以结束,节点2就更新自己的状态,由于发现自己是被半数以上人推荐的leader,因此把自己的状态改为leading。

         同样,节点3在收到节点1的投票信息后,判断节点1的状态,发现为following,这表明,节点1已经认为leader选出来了,并且是2。节点3首先更新自己的收票箱,将1的投票改为2,接着,判断选举是否结束,发现确实可以结束,节点3就更新自己的状态,由于发现自己不是被半数以上人推荐的leader,因此把自己的状态改为following。

        

         至此,选举结束,选出来的leader为2,1、3都为follower。


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值