- 基于TCP实现的Leader选举:FastLeaderElection
从3.4.0版本开始,zookeeper官方就只推荐FastLeaderElection用作Leader选举的唯一方式了,因此对于Leader选举就只讲解这个类的一些核心方法和内部类. - Notification
static public class Notification {
/*
* Format version, introduced in 3.4.6
*/
public final static int CURRENTVERSION = 0x2;
int version;
/*
* Proposed leader
*/
long leader;
/*
* zxid of the proposed leader
*/
long zxid;
/*
* Epoch
*/
long electionEpoch;
/*
* current state of sender
*/
QuorumPeer.ServerState state;
/*
* Address of sender
*/
long sid;
QuorumVerifier qv;
/*
* epoch of the proposed leader
*/
long peerEpoch;
}
这个类包装了其他服务器发送来的选票信息,比如说版本信息、推举的leader的serverId,推举的leader的zxid最大事务id,选举纪元, 推举的leader的服务器纪元,以及发送选票的服务器的serverId-sid等.
- ToSend
static public class ToSend {
static enum mType {crequest, challenge, notification, ack}
ToSend(mType type,
long leader,
long zxid,
long electionEpoch,
ServerState state,
long sid,
long peerEpoch,
byte[] configData) {
this.leader = leader;
this.zxid = zxid;
this.electionEpoch = electionEpoch;
this.state = state;
this.sid = sid;
this.peerEpoch = peerEpoch;
this.configData = configData;
}
/*
* Proposed leader in the case of notification
*/
long leader;
/*
* id contains the tag for acks, and zxid for notifications
*/
long zxid;
/*
* Epoch
*/
long electionEpoch;
/*
* Current state;
*/
QuorumPeer.ServerState state;
/*
* Address of recipient
*/
long sid;
/*
* Used to send a QuorumVerifier (configuration info)
*/
byte[] configData = dummyData;
/*
* Leader epoch
*/
long peerEpoch;
}
这个类包装了发送给其他服务器的选票信息,比如说版本信息、推举的leader的serverId,推举的leader的zxid最大事务id,选举纪元, 推举的leader的服务器纪元,以及接收选票的服务器的serverId-sid等.
从这里可以看出,ToSend和Notification基本是相对的,一个是发送,一个是接收.
消息处理器:Messenger
它有两个内部类,一个是消息接收线程类,一个是消息发送线程类
消息发送线程类:WorkerSender
public void run() {
while (!stop) {
try {
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
if(m == null) continue;
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
/**
* Called by run() once there is a new message to send.
*
* @param m message to send
*/
void process(ToSend m) {
ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
m.leader,
m.zxid,
m.electionEpoch,
m.peerEpoch,
m.configData);
manager.toSend(m.sid, requestBuffer);
}
static ByteBuffer buildMsg(int state,
long leader,
long zxid,
long electionEpoch,
long epoch,
byte[] configData) {
byte requestBytes[] = new byte[44 + configData.length];
ByteBuffer requestBuffer = ByteBuffer.wrap(requestBytes);
/*
* Building notification packet to send
*/
requestBuffer.clear();
requestBuffer.putInt(state);
requestBuffer.putLong(leader);
requestBuffer.putLong(zxid);
requestBuffer.putLong(electionEpoch);
requestBuffer.putLong(epoch);
requestBuffer.putInt(Notification.CURRENTVERSION);
requestBuffer.putInt(configData.length);
requestBuffer.put(configData);
return requestBuffer;
}
- QuorumCnxManager 中的toSend(m.sid, requestBuffer):
public void toSend(Long sid, ByteBuffer b) {
/*
* If sending message to myself, then simply enqueue it (loopback).
*/
if (this.mySid == sid) {
b.position(0);
addToRecvQueue(new Message(b.duplicate(), sid));
/*
* Otherwise send to the corresponding thread to send.
*/
} else {
/*
* Start a new connection if doesn't have one already.
*/
ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY);
ArrayBlockingQueue<ByteBuffer> oldq = queueSendMap.putIfAbsent(sid, bq);
if (oldq != null) {
addToSendQueue(oldq, b);
} else {
addToSendQueue(bq, b);
}
connectOne(sid);
}
}
分析如下:
1.循环从发送队列sendqueue中获取ToSend消息
2.buildMsg方法将ToSend对象转换成ByteBuffer对象以备发送之需,转换逻辑如下:先计算好字节数组的长度,有4个long类型数据和3个int类型数据外加一个由投票验证器QuorumVerifier转换来的字节数组configData,所以长度为4 * 8 + 3 * 4 + configData.length = 44 + configData.length;
然后依次将服务器状态、推举的leader的serverId、推举的leader的zxid、选举纪元、推举的leader的运行纪元、Notification的版本号、configData数组的长度、configData写入ByteBuffer对象.
3.将ByteBuffer对象缓存到QuorumCnxManager 的发送队列中
(1)如果是向自己发送消息,那么只需将其放到自己的接收队列中排队
(2)如果是向其他zookeeper服务发送消息,那么直接放入到发送队列中,并检验是否存在该sid对应的SendWorker,若不存在则创建一个新的连接
代码可优化点(已创建JIRA):
ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY);
ArrayBlockingQueue<ByteBuffer> oldq = queueSendMap.putIfAbsent(sid, bq);
if (oldq != null) {
addToSendQueue(oldq, b);
} else {
addToSendQueue(bq, b);
}
可优化为:
ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.computeIfAbsent(sid, serverId -> new ArrayBlockingQueue<>(SEND_CAPACITY));
addToSendQueue(bq, b);
消息接收线程类:WorkerReceiver
重点关注一下它的run方法:
public void run() {
Message response;
while (!stop) {
// Sleeps on receive
try {
response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
if(response == null) continue;
// The current protocol and two previous generations all send at least 28 bytes
if (response.buffer.capacity() < 28) {
LOG.error("Got a short response: " + response.buffer.capacity());
continue;
}
// this is the backwardCompatibility mode in place before ZK-107
// It is for a version of the protocol in which we didn't send peer epoch
// With peer epoch and version the message became 40 bytes
boolean backCompatibility28 = (response.buffer.capacity() == 28);
// this is the backwardCompatibility mode for no version information
boolean backCompatibility40 = (response.buffer.capacity() == 40);
response.buffer.clear();
// Instantiate Notification and set its attributes
Notification n = new Notification();
int rstate = response.buffer.getInt();
long rleader = response.buffer.getLong();
long rzxid = response.buffer.getLong();
long relectionEpoch = response.buffer.getLong();
long rpeerepoch;
int version = 0x0;
if (!backCompatibility28) {
rpeerepoch = response.buffer.getLong();
if (!backCompatibility40) {
/*
* Version added in 3.4.6
*/
version = response.buffer.getInt();
} else {
LOG.info("Backward compatibility mode (36 bits), server id: {}", response.sid);
}
} else {
LOG.info("Backward compatibility mode (28 bits), server id: {}", response.sid);
rpeerepoch = ZxidUtils.getEpochFromZxid(rzxid);
}
QuorumVerifier rqv = null;
// check if we have a version that includes config. If so extract config info from message.
if (version > 0x1) {
int configLength = response.buffer.getInt();
byte b[] = new byte[configLength];
response.buffer.get(b);
synchronized(self) {
try {
rqv = self.configFromString(new String(b));
QuorumVerifier curQV = self.getQuorumVerifier();
if (rqv.getVersion() > curQV.getVersion()) {
LOG.info("{} Received version: {} my version: {}", self.getId(),
Long.toHexString(rqv.getVersion()),
Long.toHexString(self.getQuorumVerifier().getVersion()));
if (self.getPeerState() == ServerState.LOOKING) {
LOG.debug("Invoking processReconfig(), state: {}", self.getServerState());
self.processReconfig(rqv, null, null, false);
if (!rqv.equals(curQV)) {
LOG.info("restarting leader election");
self.shuttingDownLE = true;
self.getElectionAlg().shutdown();
break;
}
} else {
LOG.debug("Skip processReconfig(), state: {}", self.getServerState());
}
}
} catch (IOException e) {
LOG.error("Something went wrong while processing config received from {}", response.sid);
} catch (ConfigException e) {
LOG.error("Something went wrong while processing config received from {}", response.sid);
}
}
} else {
LOG.info("Backward compatibility mode (before reconfig), server id: {}", response.sid);
}
/*
* If it is from a non-voting server (such as an observer or
* a non-voting follower), respond right away.
*/
if(!validVoter(response.sid)) {
Vote current = self.getCurrentVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
current.getId(),
current.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
} else {
// Receive new message
if (LOG.isDebugEnabled()) {
LOG.debug("Receive new notification message. My id = "
+ self.getId());
}
// State of peer that sent this message
QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
switch (rstate) {
case 0:
ackstate = QuorumPeer.ServerState.LOOKING;
break;
case 1:
ackstate = QuorumPeer.ServerState.FOLLOWING;
break;
case 2:
ackstate = QuorumPeer.ServerState.LEADING;
break;
case 3:
ackstate = QuorumPeer.ServerState.OBSERVING;
break;
default:
continue;
}
n.leader = rleader;
n.zxid = rzxid;
n.electionEpoch = relectionEpoch;
n.state = ackstate;
n.sid = response.sid;
n.peerEpoch = rpeerepoch;
n.version = version;
n.qv = rqv;
/*
* Print notification info
*/
if(LOG.isInfoEnabled()){
printNotification(n);
}
/*
* If this server is looking, then send proposed leader
*/
if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
recvqueue.offer(n);
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
if((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock.get())){
Vote v = getVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
v.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
} else {
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
Vote current = self.getCurrentVote();
if(ackstate == QuorumPeer.ServerState.LOOKING){
if (self.leader != null) {
if (leadingVoteSet != null) {
self.leader.setLeadingVoteSet(leadingVoteSet);
leadingVoteSet = null;
}
self.leader.reportLookingSid(response.sid);
}
if(LOG.isDebugEnabled()){
LOG.debug("Sending new notification. My id ={} recipient={} zxid=0x{} leader={} config version = {}",
self.getId(),
response.sid,
Long.toHexString(current.getZxid()),
current.getId(),
Long.toHexString(self.getQuorumVerifier().getVersion()));
}
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
}
}
} catch (InterruptedException e) {
LOG.warn("Interrupted Exception while waiting for new message" +
e.toString());
}
}
LOG.info("WorkerReceiver is down");
}
private boolean validVoter(long sid) {
return self.getCurrentAndNextConfigVoters().contains(sid);
}
分析如下:
1.循环通过QuorumCnxManager的消息接收队列获取消息
2.判断消息的长度是否大于28,因为当前协议和前两代协议都发送至少28个字节
3.根据协议版本读取相应的内容,如果是最新版本,将依次读取发送方服务器状态、推举的leader的serverId、推举的leader的zxid、选举纪元、推举的leader的运行纪元、Notification的版本号、发送方投票验证器QuorumVerifier转换成的configData字节数组的长度、发送方投票验证器QuorumVerifier转换成的configData字节数组
4.将configData字节数组还原成QuorumVerifier对象,然后对比发送方的QuorumVerifier版本跟自身的QuorumVerifier版本,如果发送方QuorumVerifier版本更高,那么将开启reconfig流程,然后重新开启Leader选举并结束当前消息接收线程
5.验证发送选票的zookeeper服务是否在此集群投票列表中
6.如果是无效选票,那么直接将自己的选票组装成ToSend对象放入发送队列sendqueue,然后会由发送线程将这个消息响应给发送方
7.如果是有效选票,将读取的发送方服务器状态转换为ServerState对象,然后拼装Notification对象,接下来判断自身的服务器状态:
- (7.1)如果自身是LOOKING状态(即寻找Leader中),则将Notification对象放入自己的选票接收队列recvqueue中,如果发送方的服务器状态也是LOOKING(即寻找Leader中)并且选举纪元比自身的选举纪元低,则将自己的选票组装成ToSend对象放入发送队列sendqueue,等待下一轮的发送
- (7.2)如果自身不是LOOKING状态(有可能是LEADING、FOLLOWING、OBSERVING),表明当前集群中的Leader已经选举出来了,假如发送方的服务器状态是LOOKING(即寻找Leader中),
(7.2.1)判断当前zookeeper服务角色是否是Leader,如果是的话,判断在设定的时间段内(即initLimit*tickTime)是否有半数以上的Follower与Leader建立连接,假如没有半数以上的Follower与Leader建立连接的话Leader便会退出领导,并重新开始集群Leader的选举
(7.2.2)在自身zookeeper服务确定了Leader的情况下,将有关Leader的信息封装成ToSend对象放入发送队列sendqueue,等待下一轮的发送
Leade选举的具体逻辑
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
Map<Long, Vote> recvset = new HashMap<Long, Vote>();
Map<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = minNotificationInterval;
synchronized(this){
logicalclock.incrementAndGet();
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
sendNotifications();
SyncedLearnerTracker voteSet;
/*
* Loop in which we exchange notifications until we find a leader
*/
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
if(manager.haveDelivered()){
sendNotifications();
} else {
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
if (getInitLastLoggedZxid() == -1) {
LOG.debug("Ignoring notification as our zxid is -1");
break;
}
if (n.zxid == -1) {
LOG.debug("Ignoring notification from member with -1 zxid" + n.sid);
break;
}
// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
recvset.clear();
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// don't care about the version if it's in LOOKING state
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
voteSet = getVoteTracker(
recvset, new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch));
if (voteSet.hasAllQuorums()) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
setPeerState(proposedLeader, voteSet);
Vote endVote = new Vote(proposedLeader,
proposedZxid, logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if(n.electionEpoch == logicalclock.get()){
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
voteSet = getVoteTracker(recvset, new Vote(n.version,
n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
if (voteSet.hasAllQuorums() &&
checkLeader(outofelection, n.leader, n.electionEpoch)) {
setPeerState(n.leader, voteSet);
Vote endVote = new Vote(n.leader,
n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify that
* a majority are following the same leader.
*/
outofelection.put(n.sid, new Vote(n.version, n.leader,
n.zxid, n.electionEpoch, n.peerEpoch, n.state));
voteSet = getVoteTracker(outofelection, new Vote(n.version,
n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
if (voteSet.hasAllQuorums() &&
checkLeader(outofelection, n.leader, n.electionEpoch)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
setPeerState(n.leader, voteSet);
}
Vote endVote = new Vote(n.leader, n.zxid,
n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecoginized: " + n.state
+ " (n.state), " + n.sid + " (n.sid)");
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}",
manager.getConnectionThreadCount());
}
}
//发送自身推举的Leader选票到集群中其他可投票zookeeper服务,也就是广播当前投票信息到集群
private void sendNotifications() {
for (long sid : self.getCurrentAndNextConfigVoters()) {
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
proposedLeader,
proposedZxid,
logicalclock.get(),
QuorumPeer.ServerState.LOOKING,
sid,
proposedEpoch, qv.toString().getBytes());
if(LOG.isDebugEnabled()){
LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x" +
Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get()) +
" (n.round), " + sid + " (recipient), " + self.getId() +
" (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
}
sendqueue.offer(notmsg);
}
}
//对比选票,规则如下:
//1.服务运行纪元优先,更大的则是更优的选票
//2.运行纪元相同的情况下,最大事务id-zxid更大的是更优的选票
//3.运行纪元相同并且zxid一致的情况下,serverId(即myid)更大的是更优的选票
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
if(self.getQuorumVerifier().getWeight(newId) == 0){
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
return ((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}
//获取投票统计器SyncedLearnerTracker对象并且统计关于这个选票的ack
protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote vote) {
SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
voteSet.addQuorumVerifier(self.getQuorumVerifier());
if (self.getLastSeenQuorumVerifier() != null
&& self.getLastSeenQuorumVerifier().getVersion() > self
.getQuorumVerifier().getVersion()) {
voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
}
/*
* First make the views consistent. Sometimes peers will have different
* zxids for a server depending on timing.
*/
for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
if (vote.equals(entry.getValue())) {
voteSet.addAck(entry.getKey());
}
}
return voteSet;
}
protected boolean checkLeader(
Map<Long, Vote> votes,
long leader,
long electionEpoch){
boolean predicate = true;
/*
* If everyone else thinks I'm the leader, I must be the leader.
* The other two checks are just for the case in which I'm not the
* leader. If I'm not the leader and I haven't received a message
* from leader stating that it is leading, then predicate is false.
*/
if(leader != self.getId()){
if(votes.get(leader) == null) predicate = false;
else if(votes.get(leader).getState() != ServerState.LEADING) predicate = false;
} else if(logicalclock.get() != electionEpoch) {
predicate = false;
}
return predicate;
}
之前提到QuorumPeer的run方法中判断当前服务器状态是LOOKING时会进行Leader选举,具体调用的方法就是Election接口的lookForLeader(),因此现在来分析一下这个方法是如何进行Leader选举的:
1.注册JMX服务
2.创建一个收集选票的集合recvset和收集推举的Leader集合outofelection
3.将自身选举纪元加一并且更新缓存的选票为自己
4.调用sendNotifications()这个方法,遍历可投票zookeeper服务的集合,然后将自身选票以及serverId组装成ToSend对象放入发送队列sendqueue中,将会由消息发送线程发送到集群中其他zookeeper服务
5.判断自身服务器状态是否为LOOKING并且运行标志位stop变量是否为false,如果符合条件将会一直循环进行接下来的环节:
- 1、之前提到的消息接收线程WorkerReceiver中会将接收到的关于有效选票的消息对象Notification放入recvqueue这个队列中,而现在将会从recvqueue队列获取Notification对象
- 2、通过Notification对象是否为null可以判断出是否接收到了其他zookeeper服务发来的选票信息,如果为null代表没有接收到,接下来判断发送给其他服务的消息是否都已经投递出去了,如果已经全部投递成功却没有接收到其他zookeeper服务发来的选票则会重新发送一遍选票到集群中的可投票zookeeper服务;如果消息一个都没有投递出去,那么代表没有跟集群中其他可投票zookeeper服务建立连接则需要调用connectAll()方法来进行连接;接下来会调整从recvqueue拉取消息的阻塞时间,在没超过最大拉取阻塞时间maxNotificationInterval的情况下会以2倍递增
- 3、如果获取到集群中其他zookeeper服务发来的消息,首先检验发送方和推举的Leader是否在自身投票验证器的可投票服务列表中,即验证选票的有效性,然后会根据选票中的服务状态字段state来进行不同的处理
- 4、如果发来的选票推举的Leader服务器状态是LOOKING的话,首先判断自身的最大事务id-zxid和推举的Leader服务zxid是否为-1,如果是的话说明当前两者都不适合进行Leader选举应该被忽略然后会跳出本轮循环进行下一轮的循环;接下来会对比选票的选举纪元与自身的选举纪元,如果选票的选举纪元大于自身的选举纪元,那么说明当前进行的选举已经过时了,应该将自身选举纪元设置为选票的选举纪元并且清理掉已经接收到的所有选票,然后对比接收到的选票和自己的服务器信息(serverId、zxid、peerEpoch),如果接收到的选票更优则更新缓存的选票信息为接收到的选票,如果自己的更优则更新缓存的选票信息为自己,接下来将更新后的选票信息发送到集群中的其他可投票zookeeper服务;如果选票的选举纪元小于自身的选举纪元,那么应该忽略已经过时的选票然后会跳出本轮循环进行下一轮的循环;如果选举纪元相同,那么说明处在同一选举轮次,然后会对比接收到的选票和自身缓存的推举的Leader选票,假如接收到的选票更优则更新自身缓存的选票并广播更新后的选票到集群中;接下来将接收到的选票放入选票集合recvset中,然后根据接收选票集合recvset以及当前收到的选票获取一个投票统计器SyncedLearnerTracker对象并且统计关于这个选票的ack,接下来检验是否通过本轮选举,如果通过的话再等待finalizeWait(200毫秒)获取是否还有未接收到的选票以验证提议的Leader是否有变更;如果在这期间接收到了选票那么对比接收到的选票和自身缓存的选票,如果接收到的选票更优则将该选票放入选票集合recvset中并跳出本轮循环进行下一轮的循环,如果自身缓存的选票更优则不做处理;如果在finalizeWait期间没有接收到选票则说明当前缓存的选票就是最优的Leader选票,接下来会结束本次Leader选举,然后设置相应的服务状态、清空选票集合并且将选票结果告知QuorumPeer
- 5、如果发来的选票推举的Leader服务器状态是OBSERVING的话,因为OBSERVING的zookeeper服务不参与投票,因此直接忽略
- 6、如果发来的选票推举的Leader服务器状态是FOLLOWING或者LEADING的话,说明集群Leader已经选举出来了,这时候发送的选票信息是当前集群Leader的信息,这种情况是有可能的,比如说:当前zookeeper服务启动的时候阻塞了,启动太慢导致集群中其他PARTICIPANT的zookeeper服务已经选举出了Leader,然后当前服务启动之后会将自身设置为选票并且广播到集群中,然后会接收到集群中其他服务器发来的已经选定的Leader信息,可以跟WorkerReceiver相互印证;如果选举纪元相同代表这是处于同一轮选举,这种情况就是:发送方在接收到己方发送的选票之前就已经确立了Leader(只要符合多数原则,不管是权重还是机器数,这是有可能发生的),然后己方才发送选票信息这时候对端会直接响应回确定的Leader选票(可以跟WorkerReceiver相互印证),这时候将接收到的已确定的Leader选票放入选票集合recvset中,然后根据接收选票集合recvset以及当前收到的选票获取一个投票统计器SyncedLearnerTracker对象并且统计关于这个选票的ack,接下来检验是否通过本轮选举并且检查Leader是否合法,如果合法的话会结束本次Leader选举,然后设置相应的服务状态、清空选票集合并且将选票结果告知QuorumPeer;如果选举纪元不相同,或者没通过选举,或者Leader不合法,这时候会将选票放入推举的Leader集合outofelection中,然后根据推举的Leader集合outofelection以及当前收到的选票获取一个投票统计器SyncedLearnerTracker对象并且统计关于这个选票的ack,接下来检验是否通过本轮选举并且检查Leader是否合法;如果符合条件,锁定当前FastLeaderElection对象进行选举纪元以及服务器状态的设置,接下来清空选票集合并且将选票结果告知QuorumPeer;如果不符合条件则跳出本轮循环进行下一轮的循环
- 7、发现是未知的服务状态,直接忽略
6.在lookForLeader()方法结束之后会关闭JMX服务