zookeeper源码分析—— leader选举(FastLeaderElection策略)

回顾:
上篇文章中,我们找到了入口,同时也找到了zk选举所使用的算法策略,也就是FastLeaderElection这个策略。这篇文章就详细看一下FastLeaderElection算法怎么去进行选举的
发车。。滴滴

FastLeaderElection
下面是我们zk选举算法的核心,从870行附件lookForLeader()方法开始。里面加了一些自己的注释

public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {

            //收到的投票
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
            //投票结果
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();//原子long类型,增加逻辑时钟,就是epoch
                //更新选举提议,myid  zxid  epoch
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            //发送给所有的节点
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */
            //如果是looking状态,我们会一直去和其他节点交互信息,直到选举出leader
            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                //从接收队列中拿到投票信息
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){  //检查所有的队列是否为空
                        sendNotifications();        //如果为空发送通知
                    } else {
                        manager.connectAll();  //如果没有投递出去,可能是其他server还没有启动,尝试连接
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                //判断收到的投票的sid,
                //这里判断的是收到的sid是不是属于当前集群内的
                else if (validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the current or next
                     * voting view for a replica in the current or next voting view.
                     */
                    switch (n.state) { //判断当前节点状态
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        //收到的epoch是不是比当前选举的epoch要大,如果大那么代表是新一轮选举
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);  //更新当前epoch
                            recvset.clear();  //情况收到的投票
                            //进行投票

                            /*
                             * We return true if one of the following three cases hold:
                             * 1- New epoch is higher
                             * 收到的epoch大于当前的epoch 胜出选举
                             * 2- New epoch is the same as current epoch, but new zxid is higher
                             * 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
                             * 3- New epoch is the same as current epoch, new zxid is the same
                             *  as current zxid, but server id is higher.
                             * 如果收到的epoch等于当前epoch,zxid登录当前zxid,
                             * 那么收到的myid大于当前myid的胜出选举
                             */
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch); //把胜出的消息更新到投票提议中
                            } else {  //如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();  //发送投票消息
                        } else if (n.electionEpoch < logicalclock.get()) {  //如果收到的逻辑时钟小,那么表示这个投票无效
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                            //如果收到的逻辑时钟相等,则去对比myid 、zxid、epoch
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        //把投票结果存到本地,用来做最终判断
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        //判断选举是否结束,默认算法过半同意
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);//获得最新的记过
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:  //如果是
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                            if(termPredicate(recvset, new Vote(n.leader,
                                            n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify that
                         * a majority are following the same leader.
                         * Only peer epoch is used to check that the votes come
                         * from the same ensemble. This is because there is at
                         * least one corner case in which the ensemble can be
                         * created with inconsistent zxid and election epoch
                         * info. However, given that only one ensemble can be
                         * running at a single point in time and that each 
                         * epoch is used only once, using only the epoch to 
                         * compare the votes is sufficient.
                         * 
                         * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
                         */
                        outofelection.put(n.sid, new Vote(n.leader, 
                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
                        if (termPredicate(outofelection, new Vote(n.leader,
                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
                                && checkLeader(outofelection, n.leader, IGNOREVALUE)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecoginized: " + n.state
                              + " (n.state), " + n.sid + " (n.sid)");
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

在上面代码中我们能够看到定义了两个HashMap,我都分别注释了。
//收到的投票
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
//投票结果
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
这两个HashMap一个是我们当前节点收到的投票,一个是当前节点投票的结果
我们先记住这两个变量,接着往后看

892行附近

synchronized(this){
   logicalclock.incrementAndGet();//原子long类型,增加逻辑时钟,就是epoch
  //更新选举提议,myid  zxid  epoch
   updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}

updateProposal方法就是更新我们的投票提议,里面会传了三个参数,我们去看看这三个参数都是什么

1、getInitId()

private long getInitId(){
       if(self.getQuorumVerifier().getVotingMembers().containsKey(self.getId()))       
           return self.getId();
      else return Long.MIN_VALUE;
   }

self.getId()这个得到的是我们的myid,myid是不是在哪里见过?没错,就是我们在搭建zk集群的时候配置的myid文件中的值

2、getInitLastLoggedZxid()

private long getInitLastLoggedZxid(){
       if(self.getLearnerType() == LearnerType.PARTICIPANT)
           return self.getLastLoggedZxid();
       else return Long.MIN_VALUE;
   }

这里获得是当前主机节点所见的最高zxid,zxid是什么?是我们这个节点最终的事务id

3、 getPeerEpoch()

 private long getPeerEpoch(){
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
    	try {
    		return self.getCurrentEpoch();
    	} catch(IOException e) {
    		RuntimeException re = new RuntimeException(e.getMessage());
   		re.setStackTrace(e.getStackTrace());
   		throw re;
  	}
    else return Long.MIN_VALUE;
  }

Epoch这个在这里说明一下,逻辑时钟,每轮选举结束epoch都会自增。这个获取的就是我们当前的逻辑时钟

updateProposal方法的三个参数都知道是什么了,接下来我们就去看看updateProposal这个方法做了什么事情

synchronized void updateProposal(long leader, long zxid, long epoch){
      if(LOG.isDebugEnabled()){
           LOG.debug("Updating proposal: " + leader + " (newleader), 0x"
                   + Long.toHexString(zxid) + " (newzxid), " + proposedLeader
                   + " (oldleader), 0x" + Long.toHexString(proposedZxid) + " (oldzxid)");
      }
      proposedLeader = leader;
       proposedZxid = zxid;
       proposedEpoch = epoch;
   }

这个方法挺简单,就是给把我们的myid、zxid、epoch分别赋值给proposedLeader 、proposedZxid 、proposedEpoch三个变量
还是没有看到具体的选举算法啊,别急,接着往后面看看,updateProposal方法执行过之后又做了什么事情呢?

901行附近
sendNotifications(); 有这样一个方法,字面意思,发送通知,我们点进去看看

private void sendNotifications() {
     for (long sid : self.getCurrentAndNextConfigVoters()) {
         QuorumVerifier qv = self.getQuorumVerifier();
         ToSend notmsg = new ToSend(ToSend.mType.notification,
                 proposedLeader,
                 proposedZxid,
                logicalclock.get(),
                QuorumPeer.ServerState.LOOKING,
                sid,
               proposedEpoch, qv.toString().getBytes());
       if(LOG.isDebugEnabled()){
            LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                 Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                  " (n.round), " + sid + " (recipient), " + self.getId() +
                   " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
         }
         sendqueue.offer(notmsg);
     }
   }

这里会把我们的proposedLeader 、proposedZxid 、proposedEpoch封装到ToSend中去,然后把ToSend放到一个LinkedBlockingQueue队列中(sendqueue)。
由此可见sendNotifications()方法就是把我们的proposedLeader 、proposedZxid 、proposedEpoch封装成一个消息,然后把消息放到一个队列里,具体zk什么时候把这个队列里的消息拿走的,先不管,接着后面的逻辑。
接着sendNotifications()方法之后去看

907行附近有发现一个循环,我们进循环里去看
914行附近

//从接收队列中拿到投票信息
 Notification n = recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);

recvqueue这个就是我们的一个接收队列,这里会把投票的信息拿从接收队列中拿出来,传给Notification

938行附近

else if (validVoter(n.sid) && validVoter(n.leader)) {

这里判断sid是不是当前集群下的,sid是你发送方的地址,
后面又有个判断,switch (n.state) 这个是判断发送方的节点状态,我们看LOOKING状态的。

case LOOKING:
// If notification > current, replace and send messages out
  //收到的epoch是不是比当前选举的epoch要大,如果大那么代表是新一轮选举
  if (n.electionEpoch > logicalclock.get()) {
      logicalclock.set(n.electionEpoch);  //更新当前epoch
      recvset.clear();  //情况收到的投票
      //进行投票

      /*
       * We return true if one of the following three cases hold:
       * 1- New epoch is higher
       * 收到的epoch大于当前的epoch 胜出选举
       * 2- New epoch is the same as current epoch, but new zxid is higher
       * 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
       * 3- New epoch is the same as current epoch, new zxid is the same
       *  as current zxid, but server id is higher.
       * 如果收到的epoch等于当前epoch,zxid登录当前zxid,
       * 那么收到的myid大于当前myid的胜出选举
       */
      if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
              getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
          updateProposal(n.leader, n.zxid, n.peerEpoch); //把胜出的消息更新到投票提议中
      } else {  //如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
          updateProposal(getInitId(),
                  getInitLastLoggedZxid(),
                  getPeerEpoch());
      }
      sendNotifications();  //发送投票消息
  } else if (n.electionEpoch < logicalclock.get()) {  //如果收到的逻辑时钟小,那么表示这个投票无效
      if(LOG.isDebugEnabled()){
          LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                  + Long.toHexString(n.electionEpoch)
                  + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
      }
      break;
      //如果收到的逻辑时钟相等,则去对比myid 、zxid、epoch
  } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
          proposedLeader, proposedZxid, proposedEpoch)) {
      updateProposal(n.leader, n.zxid, n.peerEpoch);
      sendNotifications();
  }

  if(LOG.isDebugEnabled()){
      LOG.debug("Adding vote: from=" + n.sid +
              ", proposed leader=" + n.leader +
              ", proposed zxid=0x" + Long.toHexString(n.zxid) +
              ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
  }

  //把投票结果存到本地,用来做最终判断
  recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

  //判断选举是否结束,默认算法过半同意
  if (termPredicate(recvset,
          new Vote(proposedLeader, proposedZxid,
                  logicalclock.get(), proposedEpoch))) {

      // Verify if there is any change in the proposed leader
      while((n = recvqueue.poll(finalizeWait,
              TimeUnit.MILLISECONDS)) != null){
          if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                  proposedLeader, proposedZxid, proposedEpoch)){
              recvqueue.put(n);//获得最新的记过
              break;
          }
      }

      /*
       * This predicate is true once we don't read any new
       * relevant message from the reception queue
       */
      if (n == null) {
          self.setPeerState((proposedLeader == self.getId()) ?
                  ServerState.LEADING: learningState());

          Vote endVote = new Vote(proposedLeader,
                  proposedZxid, proposedEpoch);
          leaveInstance(endVote);
          return endVote;
      }
  }
  break;

这里的注释写的还算详细,我这里文字就简单描述一下都做了哪些事情:
1、收到的epoch是不是比当前选举的epoch要大,如果大那么代表是新一轮选举,如果小的话代表着收到的投票是无效的,清除收到的投票。
2、然后进行选举算法

  • 收到的epoch大于当前的epoch 胜出选举
  • 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
  • 如果收到的epoch等于当前epoch,zxid也等于当前zxid,那么收到的myid大于当前myid的胜出选举

下面我来看下代码
totalOrderPredicate()方法是算法入口

/*
     * We return true if one of the following three cases hold:
        * 1- New epoch is higher
        * 收到的epoch大于当前的epoch 胜出选举
        * 2- New epoch is the same as current epoch, but new zxid is higher
        * 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
        * 3- New epoch is the same as current epoch, new zxid is the same
        *  as current zxid, but server id is higher.
        * 如果收到的epoch等于当前epoch,zxid登录当前zxid,
       * 那么收到的myid大于当前myid的胜出选举
       */
       if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
              getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
          updateProposal(n.leader, n.zxid, n.peerEpoch); //把胜出的消息更新到投票提议中
       } else {  //如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
           updateProposal(getInitId(),
                   getInitLastLoggedZxid(),
                   getPeerEpoch());
       }

我们进入这个算法去看看

protected boolean totalOrderPredicate(long newId, long newZxid,
		 long newEpoch, long curId, long >curZxid, long curEpoch) {
       LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
               Long.toHexString(newZxid) + ", proposed zxid: 0x" +
                Long.toHexString(curZxid));
       if(self.getQuorumVerifier().getWeight(newId) == 0){
          return false;
       }
      /*
      * We return true if one of the following three cases hold:
      * 1- New epoch is higher
      * 2- New epoch is the same as current epoch, but new zxid is higher
      * 3- New epoch is the same as current epoch, new zxid is the same
      *  as current zxid, but server id is higher.
      */

     return ((newEpoch > curEpoch) ||
              ((newEpoch == curEpoch) &&
               ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
   }

看到return这段代码了吗?是不是很简单就能看懂。
还没有完,这里只是看到了选举算法的判断,判断结束之后,如果收到投票胜出,那么根据收到的投票更新我们的提议。如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
更新完之后,执行sendNotifications();方法,发送我们的投票消息。
结束了吗?并没有接着往下看

994行附近,有这么一行代码

recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

recvset还记得这个东西吗?在最开始的时候定义两个HashMap,一个存储收到的投票,一个存储投票结果
然后又做了一次判断

if (termPredicate(recvset,
	        new Vote(proposedLeader, proposedZxid,
	                 logicalclock.get(), proposedEpoch))) {
	
	     // Verify if there is any change in the proposed leader
	     while((n = recvqueue.poll(finalizeWait,
	             TimeUnit.MILLISECONDS)) != null){
	         if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
	                 proposedLeader, proposedZxid, proposedEpoch)){
	             recvqueue.put(n);//获得最新的记过
	             break;
	         }
	     }
	
	     /*
	      * This predicate is true once we don't read any new
	      * relevant message from the reception queue
	      */
	     if (n == null) {
	         self.setPeerState((proposedLeader == self.getId()) ?
	                 ServerState.LEADING: learningState());
	
	         Vote endVote = new Vote(proposedLeader,
	                 proposedZxid, proposedEpoch);
	         leaveInstance(endVote);
	         return endVote;
	     }
	 }

termPredicate方法做了些什么事情呢?我们先看一下这个方法的入参都是什么
recvset 这个刚刚看到了,存储的是接收到的投票
new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch) 这个就是创建一个Vote对象呗,这个应该能够明白
接着我们去看一下termPredicate方法

private boolean termPredicate(HashMap<Long, Vote> votes, Vote vote) {
      SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
       voteSet.addQuorumVerifier(self.getQuorumVerifier());
      if (self.getLastSeenQuorumVerifier() != null
             && self.getLastSeenQuorumVerifier().getVersion() > self
                      .getQuorumVerifier().getVersion()) {
           voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
       }

      /*
       * First make the views consistent. Sometimes peers will have different zxids for a server depending >on timing.
       */
     //遍历已经收到的投票结果是否有等于当前投票提议的。如果有把当前投票放入到ack中
    for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
        if (vote.equals(entry.getValue())) {
           voteSet.addAck(entry.getKey());
      }
   }
  //判断票数是否过半
 return voteSet.hasAllQuorums();
}

这个方法重要的做了两件事
1、voteSet.addAck(entry.getKey())
遍历已经收到的投票结果是否有等于当前投票提议的。如果有把当前投票放入到voteSet的ack中
2、voteSet.hasAllQuorums();判断票数是否过半
voteSet是SyncedLearnerTracker类。

先看一下voteSet.addAck(entry.getKey())

public boolean addAck(Long sid) {
      boolean change = false;
      for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) {
           if (qvAckset.getQuorumVerifier().getVotingMembers().containsKey(sid)) {
              qvAckset.getAckset().add(sid);
               change = true;
           }
       }
       return change;
   }

这个判断是干什么的先不去管,我们看判断里面的内容 qvAckset.getAckset().add(sid);
然后发现走的是个静态内部类,的getAckset()方法,这个方法里返回了一个HashSet,然后调用HashSet的add方法。

public static class QuorumVerifierAcksetPair {
     private final QuorumVerifier qv;
    private final HashSet<Long> ackset;

     public QuorumVerifierAcksetPair(QuorumVerifier qv, HashSet<Long> ackset) {                
        this.qv = qv;
        this.ackset = ackset;
    }

     public QuorumVerifier getQuorumVerifier() {
        return this.qv;
    }

     public HashSet<Long> getAckset() {
         return this.ackset;
     }
  }

现在知道voteSet.addAck(entry.getKey());是干什么了,就是把entry.getKey()放入到一个叫做ackset的HashMap中。entry.getKey()还记得是什么吗?是我们收到的投票的sid

voteSet.addAck(entry.getKey())知道了接着往下看voteSet.hasAllQuorums()

public boolean hasAllQuorums() {
       for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) {
           if (!qvAckset.getQuorumVerifier().containsQuorum(qvAckset.getAckset()))
               return false;
       }
       return true;
   }

看好这个判断,containsQuorum方法这里使用了一个委派模式,委派给了QuorumMaj这个类去做判断,传入的参数是ackset,也就是我们刚刚看到的HashSet

下面我们去看下一下QuorumMaj类,这个类定义了4个变量

private Map<Long, QuorumServer> allMembers =
 				new HashMap<Long, QuorumServer>();
private HashMap<Long, QuorumServer> votingMembers = 
				new HashMap<Long, QuorumServer>();
private HashMap<Long, QuorumServer> observingMembers = 
				new HashMap<Long, QuorumServer>();
private long version = 0;
private int half;

这五个变量分表代表的意思是:
1.allMembers 表示此集群全部机器集合
2.votingMembers 表示此集群可投票机器集合,包含Leader和Follower
3.observingMembers 表示此集群观察者集合
4.version 表示该验证器的版本
5.half 表示整个可投票集合数

然后我们去看刚刚说的containsQuorum方法干了什么事情。

public boolean containsQuorum(Set<Long> ackSet) {
       return (ackSet.size() > half);
   }

这里会去判断ackSet的size是否大于half
那么这个half是什么什么进行赋值的呢?当然是我们的构造方法啦
我们QuorumMaj这个类只有100多行代码,找找看呗,看看什么时候给这个half赋值了

public QuorumMaj(Map<Long, QuorumServer> allMembers) {
       this.allMembers = allMembers;
       for (QuorumServer qs : allMembers.values()) {
           if (qs.type == LearnerType.PARTICIPANT) {
               votingMembers.put(Long.valueOf(qs.id), qs);
           } else {
              observingMembers.put(Long.valueOf(qs.id), qs);
           }
       }
      half = votingMembers.size() / 2;
   }

这里怎么判断的呢?
1.参数为allMembers集合时,根据LearnerType判断是属于votingMembers集合还是属于observingMembers,half为 votingMembers.size() / 2
2.参数为解析配置文件后生成的Properties对象时,解析serverId和角色,存入相应的map

由此可见
刚刚在containsQuorum方法里看到的 ackSet.size() > half 意思就是说如果票数过半则返回true

这也验证了我们zk集群要想运行正常必须得保证(n/2 + 1)台机器运行正常。

总结:
zk选举算法:
1、处理投票
收到的epoch大于当前的epoch 胜出
如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出
如果收到的epoch等于当前epoch,zxid登录当前zxid,那么收到的myid大于当前myid的胜出
2、选出胜者
票数者胜出选举

这里我们只看了zk集群启动的时候怎么进行选举的,并没有看leader宕机之后的选举过程,其实差别不大。
好了我们的zk选举的内容就先看到这里吧,有很多东西我们没有去看,
比如说我们怎么把投票协议发送出去。。不过这些并不影响我们去阅读源码。
阅读源码有时候没必要很深入,很深入的话会影响我们的判断,等我们熟悉读源码的节奏之后再去深入也不迟

不足之处还请大家指出,谢谢

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值