Zookeeper快速选举流程详解
在讲解流程之前,先说明一下选举流程中涉及到的角色,以及涉及到的关键类和变量(源码参考版本:3.4.9):
角色:1.LOOKING:竞选
2.OBSERVING:观察
3.FOLLOWING:跟随者
4.LEADER:领导者
投票信息:
1.logicalclock(electionEpoch):本地选举周期,每次投票都会自增
2.epoch(peerEpoch):选举周期,每次选举最终确定完leader结束选举流程时会自增(真正zxid的前32位)
3.zxid:数据ID,每次数据变动都会自增(真正zxid的后32位,zxid一共64位)
4.sid:该投票信息所属的serverId
5.leader:提议的leader(被提议的server的serverId,即sid)
投票比较规则:
1.epoch大的胜出,否则进行步骤2
2.zxid大的胜出,否则进行步骤3
3.sid大的胜出
比较规则的源码如下:
-
/**
-
* Check if a pair (server id, zxid) succeeds our
-
* current vote.
-
*
-
* @param id Server identifier
-
* @param zxid Last zxid observed by the issuer of this vote
-
*/
-
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
-
LOG.debug(
"id: " + newId +
", proposed id: " + curId +
", zxid: 0x" +
-
Long.toHexString(newZxid) +
", proposed zxid: 0x" + Long.toHexString(curZxid));
-
if(self.getQuorumVerifier().getWeight(newId) ==
0){
-
return
false;
-
}
-
-
/*
-
* We return true if one of the following three cases hold:
-
* 1- New epoch is higher
-
* 2- New epoch is the same as current epoch, but new zxid is higher
-
* 3- New epoch is the same as current epoch, new zxid is the same
-
* as current zxid, but server id is higher.
-
*/
-
-
return ((newEpoch > curEpoch) ||
-
((newEpoch == curEpoch) &&
-
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
-
}
下面首先讲解一下大概的选举流程,这里暂时先不用考虑投票的数据是如何进行交互的,只管拿来用即可,后续会讲到选举期间投票数据是如何进行交互的。
1.首先更新logicalclock并提议自己为leader并广播出去
2.进入本轮投票的循环
3.从recvqueue队列中获取一个投票信息,如果为空则检查是否要重发自己的投票或者重连,否则进入步骤4
4.判断投票信息中的选举状态:
LOOKING状态:1.如果对方的logicalclock大于本地的logicalclock,则更新本地的logicalclock并清空本地投票信息统计箱recvset,并将自己作为候选和投票中的leader进行比较,选择大的作为新的投票,然后广播出去,否则进入步骤2
2.如果对方的logicalclock小于本地的logicalclock,则忽略对方的投票,重新进入下一轮选举流程,否则进入步骤3
3.如果两方的logicalclock相等,则比较当前本地被推选的leader和投票中的leader,选择大的作为新的投票,然后广播出去
4.把对方的投票信息保存到本地投票统计箱recvset中,判断当前被选举的leader是否在投票中占了大多数(大于一半的server数量),如果是则需再等待finalizeWait时间(从recvqueue继续poll投票消息)看是否有人修改了leader的候选,如果有则再将该投票信息再放回recvqueue中并重新开始下一轮循环,否则确定角色,结束选举
OBSERVING状态:没有投票权,无视直接进入下一轮选举
FOLLOWING/LEADING:1.如果对方的logicalclock等于本地的logicalclock,把对方的投票信息保存到本地投票统计箱recvset中,判断对方的投票信息是否在recvset中占大多数并且确认自己确实为leader,如果是则确定角色,结束选举,否则进入步骤2
2.将对方的投票信息放入本地统计不参与投票信息箱outofelection中,判断对方的投票信息是否在outofelection中占大多数并且确认自己确实为leader,如果是则更新logicalclock,并确定角色,结束选举,否则进入下一轮选举
选举流程源码如下:
-
/**
-
* Starts a new round of leader election. Whenever our QuorumPeer
-
* changes its state to LOOKING, this method is invoked, and it
-
* sends notifications to all other peers.
-
*
-
* 开始新的一轮leader选举。
-
* 每当当前的peer的选举状态为LOOKING时,这个方法就会执行,并且会向其他peer发送提议leader消息。
-
*
-
*/
-
public Vote lookForLeader() throws InterruptedException {
-
try {
-
self.jmxLeaderElectionBean =
new LeaderElectionBean();
-
MBeanRegistry.getInstance().register(
-
self.jmxLeaderElectionBean,
self.jmxLocalPeerBean);
-
}
catch (
Exception e) {
-
LOG.warn(
"Failed to register with JMX", e);
-
self.jmxLeaderElectionBean =
null;
-
}
-
if (
self.start_fle ==
0) {
-
self.start_fle = System.currentTimeMillis();
-
}
-
try {
-
//本机统计的投票信息
-
HashMap<Long, Vote> recvset =
new HashMap<Long, Vote>();
-
-
//FOLLOWING LEADING状态的节点信息-->非LOOKING状态
-
HashMap<Long, Vote> outofelection =
new HashMap<Long, Vote>();
-
-
int notTimeout = finalizeWait;
-
-
//提议选举自己为leader
-
synchronized(this){
-
logicalclock++;
-
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
-
}
-
-
LOG.info(
"New election. My id = " +
self.getId() +
-
", proposed zxid=0x" + Long.toHexString(proposedZxid));
-
sendNotifications();
-
-
/*
-
* Loop in which we exchange notifications until we find a leader
-
*
-
* 循环:开始交换提议信息,直到选举出leader
-
*/
-
-
while ((
self.getPeerState() == ServerState.LOOKING) &&
-
(!stop)){
-
/*
-
* Remove next notification from queue, times out after 2 times
-
* the termination time
-
*/
-
Notification n = recvqueue.poll(notTimeout,
-
TimeUnit.MILLISECONDS);
-
-
/*
-
* Sends more notifications if haven't received enough.
-
* Otherwise processes new notification.
-
*/
-
if(n ==
null){
-
if(manager.haveDelivered()){
-
sendNotifications();
-
}
else {
-
manager.connectAll();
-
}
-
-
/*
-
* Exponential backoff
-
*/
-
int tmpTimeOut = notTimeout*
2;
-
notTimeout = (tmpTimeOut < maxNotificationInterval?
-
tmpTimeOut : maxNotificationInterval);
-
LOG.info(
"Notification time out: " + notTimeout);
-
}
-
else
if(
self.getVotingView().containsKey(n.sid)) {
-
/*
-
* Only proceed if the vote comes from a replica in the
-
* voting view.
-
*/
-
switch (n.state) {
-
case LOOKING:
-
// If notification > current, replace and send messages out
-
if (n.electionEpoch > logicalclock) {
-
logicalclock = n.electionEpoch;
-
recvset.clear();
-
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
-
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
-
updateProposal(n.leader, n.zxid, n.peerEpoch);
-
}
else {
-
updateProposal(getInitId(),
-
getInitLastLoggedZxid(),
-
getPeerEpoch());
-
}
-
sendNotifications();
-
}
else
if (n.electionEpoch < logicalclock) {
-
if(LOG.isDebugEnabled()){
-
LOG.debug(
"Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
-
+ Long.toHexString(n.electionEpoch)
-
+
", logicalclock=0x" + Long.toHexString(logicalclock));
-
}
-
break;
-
}
else
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
-
proposedLeader, proposedZxid, proposedEpoch)) {
-
updateProposal(n.leader, n.zxid, n.peerEpoch);
-
sendNotifications();
-
}
-
-
if(LOG.isDebugEnabled()){
-
LOG.debug(
"Adding vote: from=" + n.sid +
-
", proposed leader=" + n.leader +
-
", proposed zxid=0x" + Long.toHexString(n.zxid) +
-
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
-
}
-
-
// 把对方的投票意愿缓存起来,用于最终的统计
-
recvset.put(n.sid,
new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
-
-
if (termPredicate(recvset,
-
new Vote(proposedLeader, proposedZxid,
-
logicalclock, proposedEpoch))) {
-
-
// Verify if there is any change in the proposed leader
-
while((n = recvqueue.poll(finalizeWait,
-
TimeUnit.MILLISECONDS)) !=
null){
-
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
-
proposedLeader, proposedZxid, proposedEpoch)){
-
recvqueue.put(n);
-
break;
-
}
-
}
-
-
/*
-
* This predicate is true once we don't read any new
-
* relevant message from the reception queue
-
*/
-
if (n ==
null) {
-
self.setPeerState((proposedLeader ==
self.getId()) ?
-
ServerState.LEADING: learningState());
-
-
Vote endVote =
new Vote(proposedLeader,
-
proposedZxid,
-
logicalclock,
-
proposedEpoch);
-
leaveInstance(endVote);
-
return endVote;
-
}
-
}
-
break;
-
case OBSERVING:
-
LOG.debug(
"Notification from observer: " + n.sid);
-
break;
-
case FOLLOWING:
-
case LEADING:
-
/*
-
* Consider all notifications from the same epoch
-
* together.
-
*/
-
if(n.electionEpoch == logicalclock){
-
recvset.put(n.sid,
new Vote(n.leader,
-
n.zxid,
-
n.electionEpoch,
-
n.peerEpoch));
-
-
if(ooePredicate(recvset, outofelection, n)) {
-
self.setPeerState((n.leader ==
self.getId()) ?
-
ServerState.LEADING: learningState());
-
-
Vote endVote =
new Vote(n.leader,
-
n.zxid,
-
n.electionEpoch,
-
n.peerEpoch);
-
leaveInstance(endVote);
-
return endVote;
-
}
-
}
-
-
/*
-
* Before joining an established ensemble, verify
-
* a majority is following the same leader.
-
*/
-
outofelection.put(n.sid,
new Vote(n.version,
-
n.leader,
-
n.zxid,
-
n.electionEpoch,
-
n.peerEpoch,
-
n.state));
-
-
if(ooePredicate(outofelection, outofelection, n)) {
-
synchronized(this){
-
logicalclock = n.electionEpoch;
-
self.setPeerState((n.leader ==
self.getId()) ?
-
ServerState.LEADING: learningState());
-
}
-
Vote endVote =
new Vote(n.leader,
-
n.zxid,
-
n.electionEpoch,
-
n.peerEpoch);
-
leaveInstance(endVote);
-
return endVote;
-
}
-
break;
-
default:
-
LOG.warn(
"Notification state unrecognized: {} (n.state), {} (n.sid)",
-
n.state, n.sid);
-
break;
-
}
-
}
else {
-
LOG.warn(
"Ignoring notification from non-cluster member " + n.sid);
-
}
-
}
-
return
null;
-
}
finally {
-
try {
-
if(
self.jmxLeaderElectionBean !=
null){
-
MBeanRegistry.getInstance().unregister(
-
self.jmxLeaderElectionBean);
-
}
-
}
catch (
Exception e) {
-
LOG.warn(
"Failed to unregister with JMX", e);
-
}
-
self.jmxLeaderElectionBean =
null;
-
}
-
}
选举流程图如下:
上面讲解了快速的选举流程,那么选举中的数据是怎么交互的呢,下面来进行进一步的讲解:
在zookeeper的启动脚本zkServer.cmd可以看到有这么一行脚本内容:
-
set ZOOMAIN=org.apache.zookeeper.
server.quorum.QuorumPeerMain
-
echo
on
-
call %JAVA%
"-Dzookeeper.log.dir=%ZOO_LOG_DIR%"
"-Dzookeeper.root.logger=%ZOO_LOG4J_PROP%" -cp
"%CLASSPATH%" %ZOOMAIN%
"%ZOOCFG%" %*
我们得知启动类为:org.apache.zookeeper.server.quorum.QuorumPeerMain,跟踪代码可以得知选举流程为:
FastLeaderElection类中的lookForLeader()方法,实际发生网络交互的地方为QuorumCnxManager类,类图关系如下两图:
具体说明:
QuorumCnxManager类为实际发生网络交互的地方,负责网络通讯中收集与发送投票信息,有类图关系中可以看到此类中有个叫Listener的内部类,此类负责保证连接的一对一以及启动两个线程进行投票消息的收发:sendWorker和recvWorker;
FastLeaderElection类中也有两个内部类负责投票信息的收发:WorkerSender和WorkerReceiver。
消息发送条线:选举方法lookForLeader()中发送投票时是将投票信息放入FastLeaderElection类中的sendqueue队列中,而WorkerSender(FastLeaderElection):负责将sendqueue队列中的信息放入QuorumCnxManager类中的queueSendMap中;而sendWorker(QuorumCnxManager):负责将QuorumCnxManager类中的queueSendMap中的投票信息发送到网络上。
消息接收条线:recvWorker(QuorumCnxManager):负责接收网络上的投票信息,并放入QuorumCnxManager类的recrQueue队列中;WorkerReceiver(FastLeaderElection):负责从QuorumCnxManager类中的recrQueue队列中获取数据,并放入FastLeaderElection类中的recvqueue队列中。
自己拷贝了一份3.4.9的源码并添加了些许注释:https://github.com/learnertogether/zookeeper-3.4.9.git