首先在ZK中leader选举有这么几个细节:
1.ZXID最大会被设置为leader,因为ZXID越大,数据越新;
2.如果集群中有几个服务器具有相同的ZXID,那么SID较大的那台服务器成为leader;
3.epoch随着新leader的产生会递增;
4.服务器状态:
服务器具有四种状态,分别是LOOKING、FOLLOWING、LEADING、OBSERVING。
LOOKING:寻找leader状态。当服务器处于该状态时,它会认为当前集群中没有leader,因此需要进入leader选举状态。
FOLLOWING:跟随者状态。表明当前服务器角色是follower。
LEADING:领导者状态。表明当前服务器角色是leader。
OBSERVING:观察者状态。表明当前服务器角色是observer。
zxid,也就是事务id,为了保证事务的顺序一致性,zookeeper采用了递增的事务id号(zxid)来标识事务。所有的提议(proposa)都在被提出的时候加上了zxid。实现中zxid是一个64位的数字,它高32位是epoch (ZAB协议通过epoch编号来
区分leader周期变化的策略)用来标识leader关系是否改变,每次一个leader被选出来,它都会有一个新的epoch=(原来的epoch+1),标识当前属于那个leader的统治时期。低32位用于递增计数。
源码分析
首先入口为org.apache.zookeeper.server.quorum.QuorumPeerMain:
有个main方法,调用了initializeAndRun方法:
会判断是单机还是集群模式,单机就没必要涉及到选举了:
选举调用了runFromConfig()方法,从名称也能知道是干嘛的,基于配置文件搞点事情:
最后会阻塞式启动:
会从磁盘文件中恢复一些数据,随后会调用startLeaderElection方法进行选举操作:
如果节点状态是LOOKING的话,会投票给自己;所有的节点都会构造一个Vote的投票对象;会根据electionType获取选举的算法:
从3.4.0版本开始zookeeper只支持基于Tcp的FastLeaderElection选举协议:
会构造一个FastLeaderElection对象:
调用了一个starter方法:
再回到构建选举算法的方法:
再看看FastLeaderElection的start()方法:
主要做了两个事情:
选举初始化之后会调用super.start()方法:
前面是进行一些JMX的操作:
接下来,首先会判断当前节点的状态:
最后会进入投票的核心逻辑:
从之前的队列中获取消息:
整体代码如下:
/**
* Starts a new round of leader election. Whenever our QuorumPeer
* changes its state to LOOKING, this method is invoked, and it
* sends notifications to all other peers.
*/
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
//收到的投票
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
//存储选举结果
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
logicalclock.incrementAndGet(); //增加逻辑时钟
//吃耍自己的zxid和epoch
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
sendNotifications(); //发送投票,包括发送给自己
/*
* Loop in which we exchange notifications until we find a leader
*/
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){//主循环,直到选举出leader
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
// 从IO线程里拿到投票消息,自己的投票也在这里处理
//LinkedBlockedQueue()
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
//如果空闲的情况下,消息发完了,继续发送,一直到选出leader为止
if(manager.haveDelivered()){
sendNotifications();
} else {
//消息还没投递出去,可能是其他server还没启动,尝试再连接
manager.connectAll();
}
/*
* Exponential backoff
*/
//延长超时时间
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
//收到了投票消息,判断收到的消息是不是属于这个集群内
else if (self.getCurrentAndNextConfigVoters().contains(n.sid)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view.
*/
switch (n.state) {//判断收到消息的节点的状态
case LOOKING:
if (getInitLastLoggedZxid() == -1) {
LOG.debug("Ignoring notification as our zxid is -1");
break;
}
if (n.zxid == -1) {
LOG.debug("Ignoring notification from member with -1 zxid" + n.sid);
break;
}
// If notification > current, replace and send messages out
//判断接收到的节点epoch大于logicalclock,则表示当前是新一轮的选举
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch); //更新本地的logicalclock
recvset.clear(); //清空接收队列
//检查收到的这个消息是否可以胜出,一次比较epoch,zxid、myid
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
//胜出以后,把投票改为对方的票据
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {//否则,票据不变
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications();//继续广播消息,让其他节点知道我现在的票据
//如果收到的消息epoch小于当前节点的epoch,则忽略这条消息
} else if (n.electionEpoch < logicalclock.get()) {
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
//如果是epoch相同的话,就继续比较zxid、myid,如果胜出,则更新自己的票据,并且发出广播
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
//添加到本机投票集合,用来做选举终结判断
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//判断选举是否结束,默认算法是超过半数server同意
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
//一直等新的notification到达,直到超时
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
//确定leader
if (n == null) {
//修改状态,LEADING or FOLLOWING
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
//返回最终投票结果
Vote endVote = new Vote(proposedLeader,
proposedZxid, proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
//如果收到的选票状态不是LOOKING,比如这台机器刚加入一个已经正在运行的zk集群时
//OBSERVING机器不参数选举
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
//这2种需要参与选举
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if(n.electionEpoch == logicalclock.get()){ //判断epoch是否相同
//加入到本机的投票集合
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//投票是否结束,如果结束,再确认LEADER是否有效
//如果结束,修改自己的状态并返回投票结果
if(termPredicate(recvset, new Vote(n.leader,
n.zxid, n.electionEpoch, n.peerEpoch, n.state))
&& checkLeader(outofelection, n.leader, n.electionEpoch)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify that
* a majority are following the same leader.
* Only peer epoch is used to check that the votes come
* from the same ensemble. This is because there is at
* least one corner case in which the ensemble can be
* created with inconsistent zxid and election epoch
* info. However, given that only one ensemble can be
* running at a single point in time and that each
* epoch is used only once, using only the epoch to
* compare the votes is sufficient.
*
* @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
*/
outofelection.put(n.sid, new Vote(n.leader,
IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
if (termPredicate(outofelection, new Vote(n.leader,
IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
&& checkLeader(outofelection, n.leader, IGNOREVALUE)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecoginized: " + n.state
+ " (n.state), " + n.sid + " (n.sid)");
break;
}
} else {
LOG.warn("Ignoring notification from non-cluster member " + n.sid);
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
}
}
实际的比较逻辑如下:
具体的ZooKeeper leader选举机制可参看:
https://www.jianshu.com/p/57fecbe70540
https://www.cnblogs.com/ASPNET2008/p/6421571.html
https://blog.csdn.net/gaoshan12345678910/article/details/67638657