回顾:
上篇文章中,我们找到了入口,同时也找到了zk选举所使用的算法策略,也就是FastLeaderElection这个策略。这篇文章就详细看一下FastLeaderElection算法怎么去进行选举的
发车。。滴滴
FastLeaderElection
下面是我们zk选举算法的核心,从870行附件lookForLeader()方法开始。里面加了一些自己的注释
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
//收到的投票
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
//投票结果
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
logicalclock.incrementAndGet();//原子long类型,增加逻辑时钟,就是epoch
//更新选举提议,myid zxid epoch
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
//发送给所有的节点
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*/
//如果是looking状态,我们会一直去和其他节点交互信息,直到选举出leader
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
//从接收队列中拿到投票信息
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
if(manager.haveDelivered()){ //检查所有的队列是否为空
sendNotifications(); //如果为空发送通知
} else {
manager.connectAll(); //如果没有投递出去,可能是其他server还没有启动,尝试连接
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
//判断收到的投票的sid,
//这里判断的是收到的sid是不是属于当前集群内的
else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) { //判断当前节点状态
case LOOKING:
// If notification > current, replace and send messages out
//收到的epoch是不是比当前选举的epoch要大,如果大那么代表是新一轮选举
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch); //更新当前epoch
recvset.clear(); //情况收到的投票
//进行投票
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 收到的epoch大于当前的epoch 胜出选举
* 2- New epoch is the same as current epoch, but new zxid is higher
* 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
* 如果收到的epoch等于当前epoch,zxid登录当前zxid,
* 那么收到的myid大于当前myid的胜出选举
*/
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch); //把胜出的消息更新到投票提议中
} else { //如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications(); //发送投票消息
} else if (n.electionEpoch < logicalclock.get()) { //如果收到的逻辑时钟小,那么表示这个投票无效
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
//如果收到的逻辑时钟相等,则去对比myid 、zxid、epoch
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
//把投票结果存到本地,用来做最终判断
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//判断选举是否结束,默认算法过半同意
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);//获得最新的记过
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid, proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING: //如果是
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if(n.electionEpoch == logicalclock.get()){
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
if(termPredicate(recvset, new Vote(n.leader,
n.zxid, n.electionEpoch, n.peerEpoch, n.state))
&& checkLeader(outofelection, n.leader, n.electionEpoch)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify that
* a majority are following the same leader.
* Only peer epoch is used to check that the votes come
* from the same ensemble. This is because there is at
* least one corner case in which the ensemble can be
* created with inconsistent zxid and election epoch
* info. However, given that only one ensemble can be
* running at a single point in time and that each
* epoch is used only once, using only the epoch to
* compare the votes is sufficient.
*
* @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
*/
outofelection.put(n.sid, new Vote(n.leader,
IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
if (termPredicate(outofelection, new Vote(n.leader,
IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
&& checkLeader(outofelection, n.leader, IGNOREVALUE)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecoginized: " + n.state
+ " (n.state), " + n.sid + " (n.sid)");
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}",
manager.getConnectionThreadCount());
}
}
在上面代码中我们能够看到定义了两个HashMap,我都分别注释了。
//收到的投票
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
//投票结果
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
这两个HashMap一个是我们当前节点收到的投票,一个是当前节点投票的结果
我们先记住这两个变量,接着往后看
892行附近
synchronized(this){ logicalclock.incrementAndGet();//原子long类型,增加逻辑时钟,就是epoch //更新选举提议,myid zxid epoch updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); }
updateProposal方法就是更新我们的投票提议,里面会传了三个参数,我们去看看这三个参数都是什么
1、getInitId()
private long getInitId(){ if(self.getQuorumVerifier().getVotingMembers().containsKey(self.getId())) return self.getId(); else return Long.MIN_VALUE; }
self.getId()这个得到的是我们的myid,myid是不是在哪里见过?没错,就是我们在搭建zk集群的时候配置的myid文件中的值
2、getInitLastLoggedZxid()
private long getInitLastLoggedZxid(){ if(self.getLearnerType() == LearnerType.PARTICIPANT) return self.getLastLoggedZxid(); else return Long.MIN_VALUE; }
这里获得是当前主机节点所见的最高zxid,zxid是什么?是我们这个节点最终的事务id
3、 getPeerEpoch()
private long getPeerEpoch(){ if(self.getLearnerType() == LearnerType.PARTICIPANT) try { return self.getCurrentEpoch(); } catch(IOException e) { RuntimeException re = new RuntimeException(e.getMessage()); re.setStackTrace(e.getStackTrace()); throw re; } else return Long.MIN_VALUE; }
Epoch这个在这里说明一下,逻辑时钟,每轮选举结束epoch都会自增。这个获取的就是我们当前的逻辑时钟
updateProposal方法的三个参数都知道是什么了,接下来我们就去看看updateProposal这个方法做了什么事情
synchronized void updateProposal(long leader, long zxid, long epoch){ if(LOG.isDebugEnabled()){ LOG.debug("Updating proposal: " + leader + " (newleader), 0x" + Long.toHexString(zxid) + " (newzxid), " + proposedLeader + " (oldleader), 0x" + Long.toHexString(proposedZxid) + " (oldzxid)"); } proposedLeader = leader; proposedZxid = zxid; proposedEpoch = epoch; }
这个方法挺简单,就是给把我们的myid、zxid、epoch分别赋值给proposedLeader 、proposedZxid 、proposedEpoch三个变量
还是没有看到具体的选举算法啊,别急,接着往后面看看,updateProposal方法执行过之后又做了什么事情呢?
901行附近
sendNotifications(); 有这样一个方法,字面意思,发送通知,我们点进去看看private void sendNotifications() { for (long sid : self.getCurrentAndNextConfigVoters()) { QuorumVerifier qv = self.getQuorumVerifier(); ToSend notmsg = new ToSend(ToSend.mType.notification, proposedLeader, proposedZxid, logicalclock.get(), QuorumPeer.ServerState.LOOKING, sid, proposedEpoch, qv.toString().getBytes()); if(LOG.isDebugEnabled()){ LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x" + Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get()) + " (n.round), " + sid + " (recipient), " + self.getId() + " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)"); } sendqueue.offer(notmsg); } }
这里会把我们的proposedLeader 、proposedZxid 、proposedEpoch封装到ToSend中去,然后把ToSend放到一个LinkedBlockingQueue队列中(sendqueue)。
由此可见sendNotifications()方法就是把我们的proposedLeader 、proposedZxid 、proposedEpoch封装成一个消息,然后把消息放到一个队列里,具体zk什么时候把这个队列里的消息拿走的,先不管,接着后面的逻辑。
接着sendNotifications()方法之后去看
907行附近有发现一个循环,我们进循环里去看
914行附近//从接收队列中拿到投票信息 Notification n = recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);
recvqueue这个就是我们的一个接收队列,这里会把投票的信息拿从接收队列中拿出来,传给Notification
938行附近
else if (validVoter(n.sid) && validVoter(n.leader)) {
这里判断sid是不是当前集群下的,sid是你发送方的地址,
后面又有个判断,switch (n.state) 这个是判断发送方的节点状态,我们看LOOKING状态的。
case LOOKING:
// If notification > current, replace and send messages out
//收到的epoch是不是比当前选举的epoch要大,如果大那么代表是新一轮选举
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch); //更新当前epoch
recvset.clear(); //情况收到的投票
//进行投票
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 收到的epoch大于当前的epoch 胜出选举
* 2- New epoch is the same as current epoch, but new zxid is higher
* 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
* 如果收到的epoch等于当前epoch,zxid登录当前zxid,
* 那么收到的myid大于当前myid的胜出选举
*/
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch); //把胜出的消息更新到投票提议中
} else { //如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications(); //发送投票消息
} else if (n.electionEpoch < logicalclock.get()) { //如果收到的逻辑时钟小,那么表示这个投票无效
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
//如果收到的逻辑时钟相等,则去对比myid 、zxid、epoch
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
//把投票结果存到本地,用来做最终判断
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//判断选举是否结束,默认算法过半同意
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);//获得最新的记过
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid, proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
这里的注释写的还算详细,我这里文字就简单描述一下都做了哪些事情:
1、收到的epoch是不是比当前选举的epoch要大,如果大那么代表是新一轮选举,如果小的话代表着收到的投票是无效的,清除收到的投票。
2、然后进行选举算法
- 收到的epoch大于当前的epoch 胜出选举
- 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举
- 如果收到的epoch等于当前epoch,zxid也等于当前zxid,那么收到的myid大于当前myid的胜出选举
下面我来看下代码
totalOrderPredicate()方法是算法入口/* * We return true if one of the following three cases hold: * 1- New epoch is higher * 收到的epoch大于当前的epoch 胜出选举 * 2- New epoch is the same as current epoch, but new zxid is higher * 如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出选举 * 3- New epoch is the same as current epoch, new zxid is the same * as current zxid, but server id is higher. * 如果收到的epoch等于当前epoch,zxid登录当前zxid, * 那么收到的myid大于当前myid的胜出选举 */ if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) { updateProposal(n.leader, n.zxid, n.peerEpoch); //把胜出的消息更新到投票提议中 } else { //如果收到消息没有胜出,那么选择当前的消息更新到投票提议中 updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); }
我们进入这个算法去看看
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long >curZxid, long curEpoch) { LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" + Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid)); if(self.getQuorumVerifier().getWeight(newId) == 0){ return false; } /* * We return true if one of the following three cases hold: * 1- New epoch is higher * 2- New epoch is the same as current epoch, but new zxid is higher * 3- New epoch is the same as current epoch, new zxid is the same * as current zxid, but server id is higher. */ return ((newEpoch > curEpoch) || ((newEpoch == curEpoch) && ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId))))); }
看到return这段代码了吗?是不是很简单就能看懂。
还没有完,这里只是看到了选举算法的判断,判断结束之后,如果收到投票胜出,那么根据收到的投票更新我们的提议。如果收到消息没有胜出,那么选择当前的消息更新到投票提议中
更新完之后,执行sendNotifications();方法,发送我们的投票消息。
结束了吗?并没有接着往下看
994行附近,有这么一行代码
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
recvset还记得这个东西吗?在最开始的时候定义两个HashMap,一个存储收到的投票,一个存储投票结果
然后又做了一次判断if (termPredicate(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch))) { // Verify if there is any change in the proposed leader while((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null){ if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)){ recvqueue.put(n);//获得最新的记过 break; } } /* * This predicate is true once we don't read any new * relevant message from the reception queue */ if (n == null) { self.setPeerState((proposedLeader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(proposedLeader, proposedZxid, proposedEpoch); leaveInstance(endVote); return endVote; } }
termPredicate方法做了些什么事情呢?我们先看一下这个方法的入参都是什么
recvset 这个刚刚看到了,存储的是接收到的投票
new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch) 这个就是创建一个Vote对象呗,这个应该能够明白
接着我们去看一下termPredicate方法private boolean termPredicate(HashMap<Long, Vote> votes, Vote vote) { SyncedLearnerTracker voteSet = new SyncedLearnerTracker(); voteSet.addQuorumVerifier(self.getQuorumVerifier()); if (self.getLastSeenQuorumVerifier() != null && self.getLastSeenQuorumVerifier().getVersion() > self .getQuorumVerifier().getVersion()) { voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier()); } /* * First make the views consistent. Sometimes peers will have different zxids for a server depending >on timing. */ //遍历已经收到的投票结果是否有等于当前投票提议的。如果有把当前投票放入到ack中 for (Map.Entry<Long, Vote> entry : votes.entrySet()) { if (vote.equals(entry.getValue())) { voteSet.addAck(entry.getKey()); } } //判断票数是否过半 return voteSet.hasAllQuorums(); }
这个方法重要的做了两件事
1、voteSet.addAck(entry.getKey())
遍历已经收到的投票结果是否有等于当前投票提议的。如果有把当前投票放入到voteSet的ack中
2、voteSet.hasAllQuorums();判断票数是否过半
voteSet是SyncedLearnerTracker类。
先看一下voteSet.addAck(entry.getKey())
public boolean addAck(Long sid) { boolean change = false; for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) { if (qvAckset.getQuorumVerifier().getVotingMembers().containsKey(sid)) { qvAckset.getAckset().add(sid); change = true; } } return change; }
这个判断是干什么的先不去管,我们看判断里面的内容 qvAckset.getAckset().add(sid);
然后发现走的是个静态内部类,的getAckset()方法,这个方法里返回了一个HashSet,然后调用HashSet的add方法。public static class QuorumVerifierAcksetPair { private final QuorumVerifier qv; private final HashSet<Long> ackset; public QuorumVerifierAcksetPair(QuorumVerifier qv, HashSet<Long> ackset) { this.qv = qv; this.ackset = ackset; } public QuorumVerifier getQuorumVerifier() { return this.qv; } public HashSet<Long> getAckset() { return this.ackset; } }
现在知道voteSet.addAck(entry.getKey());是干什么了,就是把entry.getKey()放入到一个叫做ackset的HashMap中。entry.getKey()还记得是什么吗?是我们收到的投票的sid
voteSet.addAck(entry.getKey())知道了接着往下看voteSet.hasAllQuorums()
public boolean hasAllQuorums() { for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) { if (!qvAckset.getQuorumVerifier().containsQuorum(qvAckset.getAckset())) return false; } return true; }
看好这个判断,containsQuorum方法这里使用了一个委派模式,委派给了QuorumMaj这个类去做判断,传入的参数是ackset,也就是我们刚刚看到的HashSet
下面我们去看下一下QuorumMaj类,这个类定义了4个变量
private Map<Long, QuorumServer> allMembers = new HashMap<Long, QuorumServer>(); private HashMap<Long, QuorumServer> votingMembers = new HashMap<Long, QuorumServer>(); private HashMap<Long, QuorumServer> observingMembers = new HashMap<Long, QuorumServer>(); private long version = 0; private int half;
这五个变量分表代表的意思是:
1.allMembers 表示此集群全部机器集合
2.votingMembers 表示此集群可投票机器集合,包含Leader和Follower
3.observingMembers 表示此集群观察者集合
4.version 表示该验证器的版本
5.half 表示整个可投票集合数
然后我们去看刚刚说的containsQuorum方法干了什么事情。
public boolean containsQuorum(Set<Long> ackSet) { return (ackSet.size() > half); }
这里会去判断ackSet的size是否大于half
那么这个half是什么什么进行赋值的呢?当然是我们的构造方法啦
我们QuorumMaj这个类只有100多行代码,找找看呗,看看什么时候给这个half赋值了public QuorumMaj(Map<Long, QuorumServer> allMembers) { this.allMembers = allMembers; for (QuorumServer qs : allMembers.values()) { if (qs.type == LearnerType.PARTICIPANT) { votingMembers.put(Long.valueOf(qs.id), qs); } else { observingMembers.put(Long.valueOf(qs.id), qs); } } half = votingMembers.size() / 2; }
这里怎么判断的呢?
1.参数为allMembers集合时,根据LearnerType判断是属于votingMembers集合还是属于observingMembers,half为 votingMembers.size() / 2
2.参数为解析配置文件后生成的Properties对象时,解析serverId和角色,存入相应的map
由此可见
刚刚在containsQuorum方法里看到的 ackSet.size() > half 意思就是说如果票数过半则返回true
这也验证了我们zk集群要想运行正常必须得保证(n/2 + 1)台机器运行正常。
总结:
zk选举算法:
1、处理投票
收到的epoch大于当前的epoch 胜出
如果收到的epoch等于当前epoch,那么收到的zxid大于当前zxid胜出
如果收到的epoch等于当前epoch,zxid登录当前zxid,那么收到的myid大于当前myid的胜出
2、选出胜者
票数者胜出选举
这里我们只看了zk集群启动的时候怎么进行选举的,并没有看leader宕机之后的选举过程,其实差别不大。
好了我们的zk选举的内容就先看到这里吧,有很多东西我们没有去看,
比如说我们怎么把投票协议发送出去。。不过这些并不影响我们去阅读源码。
阅读源码有时候没必要很深入,很深入的话会影响我们的判断,等我们熟悉读源码的节奏之后再去深入也不迟
不足之处还请大家指出,谢谢