选举后主从同步
org.apache.zookeeper.server.quorum.QuorumPeer#run
该线程方法主要功能
- leader选举主流程 lookForLeader()
- follwer和 leader的数据同步follower.followLeader();和leader.lead();
- leader 接受并处理follwer数据的线程LearnerHandler(主要的数据同步靠它)开启
- leader 与follwer中间探活机制(ping)的维护
- follwer启动前同步leader的数据
- follwer循环接受leader发送来的数据并处理
下面跟一下这段代码.主要看注释,并且做过非核心功能的省略
@Override
public void run() {
//***************JMX注册等等
try {
/*
* Main loop
*/
while (running) {
// 本机服务器状态
switch (getPeerState()) {
case LOOKING:
//****************选举主流程 ,参考Zookeeper源码解析(二)
case OBSERVING:
//****************OBSER的主流程。不深入演技,主要搞明白Follower这里很简单
case FOLLOWING:
// 跟随者
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
follower.shutdown();
setFollower(null);
// 只要服务器在运行的过程中出现了异常就会设置成LOOKING状态
setPeerState(ServerState.LOOKING);
}
break;
case LEADING:
LOG.info("LEADING");
// 领导者
try {
setLeader(makeLeader(logFactory));
// 主要就是开启LearnerHandler线程
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
if (leader != null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
setPeerState(ServerState.LOOKING);
}
break;
}
}
} finally {
//*****************
}
}
总结:
上面的核心逻辑就是一开始大家都是looking状态。当选举完成,各自有各自的状态,会进入不同的case中。
leader->leader.lead();
follwer->follower.followLeader();
下面分析这两个方法;
leader.lead()
void lead() throws IOException, InterruptedException {
self.end_fle = Time.currentElapsedTime();
long electionTimeTaken = self.end_fle - self.start_fle;
self.setElectionTimeTaken(electionTimeTaken);
LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken);
self.start_fle = 0;
self.end_fle = 0;
zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean);
try {
self.tick.set(0);
zk.loadData();
//当前属于 第几次leader选举的结果 和最大的事务ID
leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());
// Start thread that waits for connection requests from new followers.
// 这里启动 一个线程去 链接 新的 follower 。当链接到新的 follower 时,又会新建一个LearnerHandler 线程去处理后续的事情。
cnxAcceptor = new LearnerCnxAcceptor();
cnxAcceptor.start();
readyToStart = true;
// leader 读取 epoch 如果没有过半得出,则会被wait 。会有learneHandler 唤醒它的(机智啊)
long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
/**
*
* 类似于RDBMS中的事务ID,用于标识一次更新操作的Proposal ID。为了保证顺序性,该zkid必须单调递增。
* 因此ZooKeeper使用一个64位的数来表示,高32位是Leader的epoch,从1开始,每次选出新的Leader,epoch加一。
* 低32位为该epoch内的序号,每次epoch变化,都将低32位的序号重置。这样保证了zkid的全局递增性。
*
* 这里得到 一个新leader 的起始事务id 的前32位。
*
* (epoch << 32L) | (counter & 0xffffffffL)
*/
zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
synchronized(this){
lastProposed = zk.getZxid();
}
newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
null, null);
if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {
LOG.info("NEWLEADER proposal has Zxid of "
+ Long.toHexString(newLeaderProposal.packet.getZxid()));
}
// leader 将自己的以前的最大事务id发送出去。让flower们 去确认。
// 当收到 一半的人同意时。就准备让大家开始 同步leader的数据了
waitForEpochAck(self.getId(), leaderStateSummary);
self.setCurrentEpoch(epoch);
// We have to get at least a majority of servers in sync with
// us. We do this by waiting for the NEWLEADER packet to get
// acknowledged
try {
waitForNewLeaderAck(self.getId(), zk.getZxid());
} catch (InterruptedException e) {
shutdown("Waiting for a quorum of followers, only synced with sids: [ "
+ getSidSetString(newLeaderProposal.ackSet) + " ]");
HashSet<Long> followerSet = new HashSet<Long>();
for (LearnerHandler f : learners)
followerSet.add(f.getSid());
if (self.getQuorumVerifier().containsQuorum(followerSet)) {
LOG.warn("Enough followers present. "
+ "Perhaps the initTicks need to be increased.");
}
Thread.sleep(self.tickTime);
self.tick.incrementAndGet();
return;
}
// 初始化
startZkServer();
/**
* WARNING: do not use this for anything other than QA testing
* on a real cluster. Specifically to enable verification that quorum
* can handle the lower 32bit roll-over issue identified in
* ZOOKEEPER-1277. Without this option it would take a very long
* time (on order of a month say) to see the 4 billion writes
* necessary to cause the roll-over to occur.
*
* This field allows you to override the zxid of the server. Typically
* you'll want to set it to something like 0xfffffff0 and then
* start the quorum, run some operations and see the re-election.
*/
String initialZxid = System.getProperty("zookeeper.testingonly.initialZxid");
if (initialZxid != null) {
long zxid = Long.parseLong(initialZxid);
zk.setZxid((zk.getZxid() & 0xffffffff00000000L) | zxid);
}
if (!System.getProperty("zookeeper.leaderServes", "yes").equals("no")) {
self.cnxnFactory.setZooKeeperServer(zk);
}
// Everything is a go, simply start counting the ticks
// WARNING: I couldn't find any wait statement on a synchronized
// block that would be notified by this notifyAll() call, so
// I commented it out
//synchronized (this) {
// notifyAll();
//}
// We ping twice a tick, so we only update the tick every other
// iteration
boolean tickSkip = true;
while (true) {
Thread.sleep(self.tickTime / 2);
if (!tickSkip) {
self.tick.incrementAndGet();
}
HashSet<Long> syncedSet = new HashSet<Long>();
// lock on the followers when we use it.
syncedSet.add(self.getId());
for (LearnerHandler f : getLearners()) {
// Synced set is used to check we have a supporting quorum, so only
// PARTICIPANT, not OBSERVER, learners should be used
if (f.synced() && f.getLearnerType() == LearnerType.PARTICIPANT) {
syncedSet.add(f.getSid());
}
f.ping();
}
// check leader running status
if (!this.isRunning()) {
shutdown("Unexpected internal error");
return;
}
if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) {
//if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) {
// Lost quorum, shutdown
shutdown("Not sufficient followers synced, only synced with sids: [ "
+ getSidSetString(syncedSet) + " ]");
// make sure the order is the same!
// the leader goes to looking
return;
}
tickSkip = !tickSkip;
}
} finally {
zk.unregisterJMX(this);
}
}
这里可以看出来 其实比较简单
- 用一个LearnerCnxAcceptor去接受所有follwer的连接,每个follwer都会建立一条socket连接,并创建一个LearenHandler线程去与Follower交互数据
- 有一个while循环一直和follwer发送ping
org.apache.zookeeper.server.quorum.Leader.LearnerCnxAcceptor#run
@Override
public void run() {
try {
while (!stop) {
try{
Socket s = ss.accept();
// start with the initLimit, once the ack is processed
// in LearnerHandler switch to the syncLimit
s.setSoTimeout(self.tickTime * self.initLimit);
s.setTcpNoDelay(nodelay);
BufferedInputStream is = new BufferedInputStream(
s.getInputStream());
// 除开Leader服务器,其他服务器都会与Leader建立连接,这个时候都会新建出一个LearnerHandler线程
LearnerHandler fh = new LearnerHandler(s, is, Leader.this);
fh.start();
} catch (SocketException e) {
if (stop) {
LOG.info("exception while shutting down acceptor: "
+ e);
// When Leader.shutdown() calls ss.close(),
// the call to accept throws an exception.
// We catch and set stop to true.
stop = true;
} else {
throw e;
}
} catch (SaslException e){
LOG.error("Exception while connecting to quorum learner", e);
}
}
} catch (Exception e) {
LOG.warn("Exception while accepting follower", e);
}
}
LearenHandler主逻辑 run方法是这个类最核心的方法,完成和Learner的启动时的数据同步,同步完成后进行正常的交互
@Override
public void run() {
try {
leader.addLearnerHandler(this);
tickOfNextAckDeadline = leader.self.tick.get()
+ leader.self.initLimit + leader.self.syncLimit;
ia = BinaryInputArchive.getArchive(bufferedInput);
bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
oa = BinaryOutputArchive.getArchive(bufferedOutput);
QuorumPacket qp = new QuorumPacket();
ia.readRecord(qp, "packet");
if(qp.getType() != Leader.FOLLOWERINFO && qp.getType() != Leader.OBSERVERINFO){
LOG.error("First packet " + qp.toString()
+ " is not FOLLOWERINFO or OBSERVERINFO!");
return;
}
byte learnerInfoData[] = qp.getData();
if (learnerInfoData != null) {
// 2、接收learner发送过来的LearnerInfo,获取sid(serverId)
if (learnerInfoData.length == 8) {
ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
this.sid = bbsid.getLong();
} else {
LearnerInfo li = new LearnerInfo();
ByteBufferInputStream.byteBuffer2Record(ByteBuffer.wrap(learnerInfoData), li);
this.sid = li.getServerid();
this.version = li.getProtocolVersion();
}
} else {
this.sid = leader.followerCounter.getAndDecrement();
}
LOG.info("Follower sid: " + sid + " : info : "
+ leader.self.quorumPeers.get(sid));
if (qp.getType() == Leader.OBSERVERINFO) {
learnerType = LearnerType.OBSERVER;
}
//记录当前learner的最新的epoch
long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
long peerLastZxid;
StateSummary ss = null;
long zxid = qp.getZxid();
// 如果learner的epoch(前32位)比自己高,更新自己的
long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
if (this.getVersion() < 0x10000) {
// we are going to have to extrapolate the epoch information
long epoch = ZxidUtils.getEpochFromZxid(zxid);
ss = new StateSummary(epoch, zxid);
// fake the message
leader.waitForEpochAck(this.getSid(), ss);
} else {
//
byte ver[] = new byte[4];
ByteBuffer.wrap(ver).putInt(0x10000);
// 3、发送leader状态,以LEADERINFO的形式
QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
oa.writeRecord(newEpochPacket, "packet");
bufferedOutput.flush();
// 6、获取ack
QuorumPacket ackEpochPacket = new QuorumPacket();
ia.readRecord(ackEpochPacket, "packet");
// ack不对则报错
if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
LOG.error(ackEpochPacket.toString()
+ " is not ACKEPOCH");
return;
}
ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
// 同理 这里还是校验 ack是否过半
leader.waitForEpochAck(this.getSid(), ss);
}
peerLastZxid = ss.getLastZxid();
// 至此 follower已经完成了对leader的注册。
// 下面 就要开始 处理 follower的 数据同步了。时刻注意 当前 线程即this。只对应一个follower。
/* the default to send to the follower */
int packetToSend = Leader.SNAP;
long zxidToSend = 0;
long leaderLastZxid = 0;
/** the packets that the follower needs to get updates from **/
long updates = peerLastZxid;
/* we are sending the diff check if we have proposals in memory to be able to
* send a diff to the
*/
// 加读锁,这段时间内只能读,不能写(leafer端暂时不能接受写请求)
ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
ReadLock rl = lock.readLock();
try {
rl.lock();
// 获取leader本地的最小,最大历史提交log的zxid;
// 每次请求都会有一个 大小500链表缓存。即缓存最后500个请求
// org.apache.zookeeper.server.ZKDatabase.addCommittedProposal(org.apache.zookeeper.server.Request)
final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
LOG.info("Synchronizing with Follower sid: " + sid
+" maxCommittedLog=0x"+Long.toHexString(maxCommittedLog)
+" minCommittedLog=0x"+Long.toHexString(minCommittedLog)
+" peerLastZxid=0x"+Long.toHexString(peerLastZxid));
// 历史提交记录()
LinkedList<Proposal> proposals = leader.zk.getZKDatabase().getCommittedLog();
// 如果learner(follower)的最新zxid和leader当前最新的是相等的,则没有diff
if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
// Follower is already sync with us, send empty diff
LOG.info("leader and follower are in sync, zxid=0x{}",
Long.toHexString(peerLastZxid));
packetToSend = Leader.DIFF;
zxidToSend = peerLastZxid;
} else if (proposals.size() != 0) {
// 不相等,则要判断哪边的比较新
LOG.debug("proposal size is {}", proposals.size());
// 如果learner的zxid在leader的[minCommittedLog, maxCommittedLog]范围内
if ((maxCommittedLog >= peerLastZxid)
&& (minCommittedLog <= peerLastZxid)) {
LOG.debug("Sending proposals to follower");
// as we look through proposals, this variable keeps track of previous
// proposal Id.
long prevProposalZxid = minCommittedLog;
// Keep track of whether we are about to send the first packet.
// Before sending the first packet, we have to tell the learner
// whether to expect a trunc or a diff
boolean firstPacket=true;
// If we are here, we can use committedLog to sync with
// follower. Then we only need to decide whether to
// send trunc or not
packetToSend = Leader.DIFF;
zxidToSend = maxCommittedLog;
for (Proposal propose: proposals) {
// skip the proposals the peer already has
// 已经有的就不用发送了
if (propose.packet.getZxid() <= peerLastZxid) {
prevProposalZxid = propose.packet.getZxid();
continue;
} else {
// If we are sending the first packet, figure out whether to trunc
// in case the follower has some proposals that the leader doesn't
// 判断第一个packet,如果第一个就需要回滚
if (firstPacket) {
firstPacket = false;
// Does the peer have some proposals that the leader hasn't seen yet
if (prevProposalZxid < peerLastZxid) { //如果learner有一些leader不知道的请求(正常来说应该是prevProposalZxid == peerLastZxid)
// send a trunc message before sending the diff
packetToSend = Leader.TRUNC; // 让learner回滚
zxidToSend = prevProposalZxid;
updates = zxidToSend;
}
}
queuePacket(propose.packet);
QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
null, null);
queuePacket(qcommit);
}
}
/**
* 发送数据包
* data
* COMMIT
* data
* COMMIT
* data
* COMMIT
* .....
*
*/
} else if (peerLastZxid > maxCommittedLog) {
// learner超过的,也回滚
LOG.debug("Sending TRUNC to follower zxidToSend=0x{} updates=0x{}",
Long.toHexString(maxCommittedLog),
Long.toHexString(updates));
packetToSend = Leader.TRUNC;
zxidToSend = maxCommittedLog;
updates = zxidToSend;
} else {
LOG.warn("Unhandled proposal scenario");
}
} else {
// just let the state transfer happen
LOG.debug("proposals is empty");
}
LOG.info("Sending " + Leader.getPacketType(packetToSend));
// 将leader未完成的请求 也添加进来
leaderLastZxid = leader.startForwarding(this, updates);
} finally {
rl.unlock();
}
//生成NEWLEADER的packet,发给learner代表自己需要同步的信息发完了
QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
ZxidUtils.makeZxid(newEpoch, 0), null, null);
if (getVersion() < 0x10000) {
oa.writeRecord(newLeaderQP, "packet");
} else {
//加入发送队列
queuedPackets.add(newLeaderQP);
}
bufferedOutput.flush(); //当getVersion() < 0x10000 发送NEWLEADER消息
//Need to set the zxidToSend to the latest zxid
if (packetToSend == Leader.SNAP) { //如果是SNAP同步,获取zxid
zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
}
// 写发送类型,再发送数据
oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
bufferedOutput.flush();
/* if we are not truncating or sending a diff just send a snapshot */
if (packetToSend == Leader.SNAP) { //如果发出snap,代表告知learner进行snap方式的数据同步
LOG.info("Sending snapshot last zxid of peer is 0x"
+ Long.toHexString(peerLastZxid) + " "
+ " zxid of leader is 0x"
+ Long.toHexString(leaderLastZxid)
+ "sent zxid of db as 0x"
+ Long.toHexString(zxidToSend));
// Dump data to peer
leader.zk.getZKDatabase().serializeSnapshot(oa); //SNAP恢复就是把当前的db的序列化内容发送出去
oa.writeString("BenWasHere", "signature"); //有特定的签名
}
bufferedOutput.flush();
// Start sending packets
// 前面已经将需要发送的packet 放到队列里面了。这里开个线程慢慢同步好了。
// 直到从队列获取的是空的packet才会结束线程
// 什么才会从队列里获取空的packet?当shutdown()的时候。才会放一个空的packet
new Thread() {
public void run() {
Thread.currentThread().setName(
"Sender-" + sock.getRemoteSocketAddress());
try {
sendPackets(); //不断发送packets直到接受到proposalOfDeath
} catch (InterruptedException e) {
LOG.warn("Unexpected interruption",e);
}
}
}.start(); //启动线程,发送消息去同步
/*
* Have to wait for the first ACK, wait until
* the leader is ready, and only then we can
* start processing messages.
*/
qp = new QuorumPacket();
ia.readRecord(qp, "packet");
if(qp.getType() != Leader.ACK){
LOG.error("Next packet was supposed to be an ACK");
return;
}
LOG.info("Received NEWLEADER-ACK message from " + getSid());
leader.waitForNewLeaderAck(getSid(), qp.getZxid()); //等待有过半参与者返回ACK
// 上面代码之后代表leader与learner直接初始化完成了
// 正常同步开启
syncLimitCheck.start();
// now that the ack has been processed expect the syncLimit
// 设置同步时间
sock.setSoTimeout(leader.self.tickTime * leader.self.syncLimit); //请求阶段的读取超时时间 为 tickTime * syncLimit
/*
* Wait until leader starts up
*/
synchronized(leader.zk){
while(!leader.zk.isRunning() && !this.isInterrupted()){
leader.zk.wait(20);
}
}
// Mutation packets will be queued during the serialize,
// so we need to mark when the peer can actually start
// using the data
// 发送update代表过半的机器回复了NEWLEADER的ACK,可以开始接受请求了
queuedPackets.add(new QuorumPacket(Leader.UPTODATE, -1, null, null));
while (true) { //正常交互,处理learner的请求等
// 不停的从ia充获取从Learner发过来的QuorumPacket
qp = new QuorumPacket();
ia.readRecord(qp, "packet");
long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
if (qp.getType() == Leader.PING) {
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
if (LOG.isTraceEnabled()) {
ZooTrace.logQuorumPacket(LOG, traceMask, 'i', qp);
}
tickOfNextAckDeadline = leader.self.tick.get() + leader.self.syncLimit;
ByteBuffer bb;
long sessionId;
int cxid;
int type;
// 处理其他服务器的请求
switch (qp.getType()) {
case Leader.ACK:
// 收到同步ack
if (this.learnerType == LearnerType.OBSERVER) {
if (LOG.isDebugEnabled()) {
LOG.debug("Received ACK from Observer " + this.sid);
}
}
syncLimitCheck.updateAck(qp.getZxid());
// 提交本地的request
leader.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress());
break;
case Leader.PING:
// Process the touches
ByteArrayInputStream bis = new ByteArrayInputStream(qp
.getData());
DataInputStream dis = new DataInputStream(bis);
while (dis.available() > 0) {
long sess = dis.readLong();
int to = dis.readInt();
leader.zk.touch(sess, to);
}
break;
case Leader.REVALIDATE:
// 重新激活session
bis = new ByteArrayInputStream(qp.getData());
dis = new DataInputStream(bis);
long id = dis.readLong();
int to = dis.readInt();
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
dos.writeLong(id);
boolean valid = leader.zk.touch(id, to);
if (valid) {
try {
//set the session owner
// as the follower that
// owns the session
leader.zk.setOwner(id, this);
} catch (SessionExpiredException e) {
LOG.error("Somehow session " + Long.toHexString(id) + " expired right after being renewed! (impossible)", e);
}
}
if (LOG.isTraceEnabled()) {
ZooTrace.logTraceMessage(LOG,
ZooTrace.SESSION_TRACE_MASK,
"Session 0x" + Long.toHexString(id)
+ " is valid: "+ valid);
}
dos.writeBoolean(valid);
qp.setData(bos.toByteArray());
// 重新吧这个qp加进去,发给learner
queuedPackets.add(qp);
break;
case Leader.REQUEST:
// 这里有一个问题,作为Leanner它是怎么把这个request发出来的,应该要用到QuorumServer中的同步端口
bb = ByteBuffer.wrap(qp.getData());
sessionId = bb.getLong();
cxid = bb.getInt();
type = bb.getInt();
bb = bb.slice();
Request si;
if(type == OpCode.sync){
// 请求同步的请求
si = new LearnerSyncRequest(this, sessionId, cxid, type, bb, qp.getAuthinfo());
} else {
// 正常的请求
si = new Request(null, sessionId, cxid, type, bb, qp.getAuthinfo());
}
si.setOwner(this);
// 本地提交请求,和单机模式类似了,FinalRequestProcessor中会处理请求
leader.zk.submitRequest(si);
break;
default:
LOG.warn("unexpected quorum packet, type: {}", packetToString(qp));
break;
}
}
} catch (IOException e) {
if (sock != null && !sock.isClosed()) {
LOG.error("Unexpected exception causing shutdown while sock "
+ "still open", e);
//close the socket to make sure the
//other side can see it being close
try {
sock.close();
} catch(IOException ie) {
// do nothing
}
}
} catch (InterruptedException e) {
LOG.error("Unexpected exception causing shutdown", e);
} finally {
LOG.warn("******* GOODBYE "
+ (sock != null ? sock.getRemoteSocketAddress() : "<null>")
+ " ********");
// 不是一个普通的关闭
shutdown();
}
}
总结
- 在LearenHandler里面先与Leader交互进行数据同步
- 又启动了一个线程专门向follwer发送数据(包括这次需要同步的数据)
- 自己本身又在哪里循环接受follwer发送来的数据接着处理
有意思的数据同步
所以可以看出zk数据有三部分存储。snpshot文件里面存储的是全量的数据。最近500条记录会缓存在一个链表里。还有部分是leader已经接收到的,但是还未生效的。
计算怎么同步的逻辑
// 加读锁,这段时间内只能读,不能写(leafer端暂时不能接受写请求)
ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
ReadLock rl = lock.readLock();
try {
rl.lock();
// 获取leader本地的最小,最大历史提交log的zxid;
// 每次请求都会有一个 大小500链表缓存。即缓存最后500个请求
// org.apache.zookeeper.server.ZKDatabase.addCommittedProposal(org.apache.zookeeper.server.Request)
final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
LOG.info("Synchronizing with Follower sid: " + sid
+" maxCommittedLog=0x"+Long.toHexString(maxCommittedLog)
+" minCommittedLog=0x"+Long.toHexString(minCommittedLog)
+" peerLastZxid=0x"+Long.toHexString(peerLastZxid));
// 历史提交记录()
LinkedList<Proposal> proposals = leader.zk.getZKDatabase().getCommittedLog();
// 如果learner(follower)的最新zxid和leader当前最新的是相等的,则没有diff
if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
// Follower is already sync with us, send empty diff
LOG.info("leader and follower are in sync, zxid=0x{}",
Long.toHexString(peerLastZxid));
packetToSend = Leader.DIFF;
zxidToSend = peerLastZxid;
} else if (proposals.size() != 0) {
// 不相等,则要判断哪边的比较新
LOG.debug("proposal size is {}", proposals.size());
// 如果learner的zxid在leader的[minCommittedLog, maxCommittedLog]范围内
if ((maxCommittedLog >= peerLastZxid)
&& (minCommittedLog <= peerLastZxid)) {
LOG.debug("Sending proposals to follower");
// as we look through proposals, this variable keeps track of previous
// proposal Id.
long prevProposalZxid = minCommittedLog;
// Keep track of whether we are about to send the first packet.
// Before sending the first packet, we have to tell the learner
// whether to expect a trunc or a diff
boolean firstPacket=true;
// If we are here, we can use committedLog to sync with
// follower. Then we only need to decide whether to
// send trunc or not
packetToSend = Leader.DIFF;
zxidToSend = maxCommittedLog;
for (Proposal propose: proposals) {
// skip the proposals the peer already has
// 已经有的就不用发送了
if (propose.packet.getZxid() <= peerLastZxid) {
prevProposalZxid = propose.packet.getZxid();
continue;
} else {
// If we are sending the first packet, figure out whether to trunc
// in case the follower has some proposals that the leader doesn't
// 判断第一个packet,如果第一个就需要回滚
if (firstPacket) {
firstPacket = false;
// Does the peer have some proposals that the leader hasn't seen yet
if (prevProposalZxid < peerLastZxid) { //如果learner有一些leader不知道的请求(正常来说应该是prevProposalZxid == peerLastZxid)
// send a trunc message before sending the diff
packetToSend = Leader.TRUNC; // 让learner回滚
zxidToSend = prevProposalZxid;
updates = zxidToSend;
}
}
queuePacket(propose.packet);
QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
null, null);
queuePacket(qcommit);
}
}
/**
* 发送数据包
* data
* COMMIT
* data
* COMMIT
* data
* COMMIT
* .....
*
*/
} else if (peerLastZxid > maxCommittedLog) {
// learner超过的,也回滚
LOG.debug("Sending TRUNC to follower zxidToSend=0x{} updates=0x{}",
Long.toHexString(maxCommittedLog),
Long.toHexString(updates));
packetToSend = Leader.TRUNC;
zxidToSend = maxCommittedLog;
updates = zxidToSend;
} else {
LOG.warn("Unhandled proposal scenario");
}
} else {
// just let the state transfer happen
LOG.debug("proposals is empty");
}
LOG.info("Sending " + Leader.getPacketType(packetToSend));
// 将leader未完成的请求 也添加进来
leaderLastZxid = leader.startForwarding(this, updates);
} finally {
rl.unlock();
}
//生成NEWLEADER的packet,发给learner代表自己需要同步的信息发完了
QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
ZxidUtils.makeZxid(newEpoch, 0), null, null);
if (getVersion() < 0x10000) {
oa.writeRecord(newLeaderQP, "packet");
} else {
//加入发送队列
queuedPackets.add(newLeaderQP);
}
bufferedOutput.flush(); //当getVersion() < 0x10000 发送NEWLEADER消息
//Need to set the zxidToSend to the latest zxid
if (packetToSend == Leader.SNAP) { //如果是SNAP同步,获取zxid
zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
}
// 写发送类型,再发送数据
oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
bufferedOutput.flush();
/* if we are not truncating or sending a diff just send a snapshot */
if (packetToSend == Leader.SNAP) { //如果发出snap,代表告知learner进行snap方式的数据同步
LOG.info("Sending snapshot last zxid of peer is 0x"
+ Long.toHexString(peerLastZxid) + " "
+ " zxid of leader is 0x"
+ Long.toHexString(leaderLastZxid)
+ "sent zxid of db as 0x"
+ Long.toHexString(zxidToSend));
// Dump data to peer
leader.zk.getZKDatabase().serializeSnapshot(oa); //SNAP恢复就是把当前的db的序列化内容发送出去
oa.writeString("BenWasHere", "signature"); //有特定的签名
}
bufferedOutput.flush();
// Start sending packets
// 前面已经将需要发送的packet 放到队列里面了。这里开个线程慢慢同步好了。
// 直到从队列获取的是空的packet才会结束线程
// 什么才会从队列里获取空的packet?当shutdown()的时候。才会放一个空的packet
new Thread() {
public void run() {
Thread.currentThread().setName(
"Sender-" + sock.getRemoteSocketAddress());
try {
sendPackets(); //不断发送packets直到接受到proposalOfDeath
} catch (InterruptedException e) {
LOG.warn("Unexpected interruption",e);
}
}
}.start(); //启动线程,发送消息去同步
/*
* Have to wait for the first ACK, wait until
* the leader is ready, and only then we can
* start processing messages.
*/
qp = new QuorumPacket();
ia.readRecord(qp, "packet");
if(qp.getType() != Leader.ACK){
LOG.error("Next packet was supposed to be an ACK");
return;
}
LOG.info("Received NEWLEADER-ACK message from " + getSid());
leader.waitForNewLeaderAck(getSid(), qp.getZxid()); //等待有过半参与者返回ACK
// 上面代码之后代表leader与learner直接初始化完成了
follower.followLeader()
void followLeader() throws InterruptedException {
//****************JMX注册 不重要
try {
QuorumServer leaderServer = findLeader();
try {
connectToLeader(leaderServer.addr, leaderServer.hostname); // 连接leader
long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO); // 发送 (注册流程)
//check to see if the leader zxid is lower than ours
//this should never happen but is just a safety check
//检查leader的zxid是否比本机的zxid低,但是应该不会出现这种情况。例行检查
long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
if (newEpoch < self.getAcceptedEpoch()) {
LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
+ " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
throw new IOException("Error: Epoch of leader is lower");
}
syncWithLeader(newEpochZxid); // 完成了数据同于,一起服务器初始化,可以处理请求了
QuorumPacket qp = new QuorumPacket();
while (this.isRunning()) {
readPacket(qp);
processPacket(qp);
}
} catch (Exception e) {
LOG.warn("Exception when following the leader", e);
try {
sock.close();
} catch (IOException e1) {
e1.printStackTrace();
}
// clear pending revalidations
pendingRevalidations.clear();
}
} finally {
zk.unregisterJMX((Learner)this);
}
}
follower启动时跟leader接受同步的逻辑
/**
* Finally, synchronize our history with the Leader.
* @param newLeaderZxid
* @throws IOException
* @throws InterruptedException
*/
protected void syncWithLeader(long newLeaderZxid) throws IOException, InterruptedException{
QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
QuorumPacket qp = new QuorumPacket();
long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
// In the DIFF case we don't need to do a snapshot because the transactions will sync on top of any existing snapshot
// For SNAP and TRUNC the snapshot is needed to save that history
boolean snapshotNeeded = true;
readPacket(qp);
LinkedList<Long> packetsCommitted = new LinkedList<Long>();
LinkedList<PacketInFlight> packetsNotCommitted = new LinkedList<PacketInFlight>();
synchronized (zk) {
if (qp.getType() == Leader.DIFF) {
LOG.info("Getting a diff from the leader 0x{}", Long.toHexString(qp.getZxid()));
snapshotNeeded = false;
}
else if (qp.getType() == Leader.SNAP) {
LOG.info("Getting a snapshot from leader 0x" + Long.toHexString(qp.getZxid()));
// The leader is going to dump the database
// clear our own database and read
zk.getZKDatabase().clear();
zk.getZKDatabase().deserializeSnapshot(leaderIs);
String signature = leaderIs.readString("signature");
if (!signature.equals("BenWasHere")) {
LOG.error("Missing signature. Got " + signature);
throw new IOException("Missing signature");
}
zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
} else if (qp.getType() == Leader.TRUNC) {
//we need to truncate the log to the lastzxid of the leader
LOG.warn("Truncating log to get in sync with the leader 0x"
+ Long.toHexString(qp.getZxid()));
boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
if (!truncated) {
// not able to truncate the log
LOG.error("Not able to truncate the log "
+ Long.toHexString(qp.getZxid()));
System.exit(13);
}
zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
}
else {
LOG.error("Got unexpected packet from leader "
+ qp.getType() + " exiting ... " );
System.exit(13);
}
zk.createSessionTracker();
long lastQueued = 0;
// in Zab V1.0 (ZK 3.4+) we might take a snapshot when we get the NEWLEADER message, but in pre V1.0
// we take the snapshot on the UPDATE message, since Zab V1.0 also gets the UPDATE (after the NEWLEADER)
// we need to make sure that we don't take the snapshot twice.
boolean isPreZAB1_0 = true;
//If we are not going to take the snapshot be sure the transactions are not applied in memory
// but written out to the transaction log
boolean writeToTxnLog = !snapshotNeeded;
// we are now going to start getting transactions to apply followed by an UPTODATE
outerLoop:
while (self.isRunning()) {
readPacket(qp);
switch(qp.getType()) {
case Leader.PROPOSAL:
PacketInFlight pif = new PacketInFlight();
pif.hdr = new TxnHeader();
pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);
if (pif.hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(pif.hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
System.out.println("=========提议");
lastQueued = pif.hdr.getZxid();
// 投票时,先加在这个list中,等提交命令来了就可以提交了
packetsNotCommitted.add(pif);
break;
case Leader.COMMIT:
if (!writeToTxnLog) {
pif = packetsNotCommitted.peekFirst();
if (pif.hdr.getZxid() != qp.getZxid()) {
LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
} else {
zk.processTxn(pif.hdr, pif.rec);
packetsNotCommitted.remove();
}
} else {
// 提交命令过来了就放入到packetsCommitted中
packetsCommitted.add(qp.getZxid());
}
break;
case Leader.INFORM:
/*
* Only observer get this type of packet. We treat this
* as receiving PROPOSAL and COMMMIT.
*/
PacketInFlight packet = new PacketInFlight();
packet.hdr = new TxnHeader();
packet.rec = SerializeUtils.deserializeTxn(qp.getData(), packet.hdr);
// Log warning message if txn comes out-of-order
if (packet.hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(packet.hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
lastQueued = packet.hdr.getZxid();
if (!writeToTxnLog) {
// Apply to db directly if we haven't taken the snapshot
zk.processTxn(packet.hdr, packet.rec);
} else {
packetsNotCommitted.add(packet);
packetsCommitted.add(qp.getZxid());
}
break;
case Leader.UPTODATE:
if (isPreZAB1_0) {
zk.takeSnapshot();
self.setCurrentEpoch(newEpoch);
}
self.cnxnFactory.setZooKeeperServer(zk);
break outerLoop;
case Leader.NEWLEADER: // Getting NEWLEADER here instead of in discovery
// means this is Zab 1.0
// Create updatingEpoch file and remove it after current
// epoch is set. QuorumPeer.loadDataBase() uses this file to
// detect the case where the server was terminated after
// taking a snapshot but before setting the current epoch.
File updating = new File(self.getTxnFactory().getSnapDir(),
QuorumPeer.UPDATING_EPOCH_FILENAME);
if (!updating.exists() && !updating.createNewFile()) {
throw new IOException("Failed to create " +
updating.toString());
}
if (snapshotNeeded) {
zk.takeSnapshot();
}
self.setCurrentEpoch(newEpoch);
if (!updating.delete()) {
throw new IOException("Failed to delete " +
updating.toString());
}
writeToTxnLog = true; //Anything after this needs to go to the transaction log, not applied directly in memory
isPreZAB1_0 = false;
writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
break;
}
}
}
// 不过是什么命令都会返回ack
ack.setZxid(ZxidUtils.makeZxid(newEpoch, 0));
writePacket(ack, true);
sock.setSoTimeout(self.tickTime * self.syncLimit);
// 服务器初始化,和单机模式一样的了
zk.startup();
/*
* Update the election vote here to ensure that all members of the
* ensemble report the same vote to new servers that start up and
* send leader election notifications to the ensemble.
*
* @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
*/
self.updateElectionVote(newEpoch);
// We need to log the stuff that came in between the snapshot and the uptodate
if (zk instanceof FollowerZooKeeperServer) {
FollowerZooKeeperServer fzk = (FollowerZooKeeperServer)zk;
for(PacketInFlight p: packetsNotCommitted) {
fzk.logRequest(p.hdr, p.rec);
}
for(Long zxid: packetsCommitted) {
fzk.commit(zxid);
}
} else if (zk instanceof ObserverZooKeeperServer) {
// Similar to follower, we need to log requests between the snapshot
// and UPTODATE
ObserverZooKeeperServer ozk = (ObserverZooKeeperServer) zk;
for (PacketInFlight p : packetsNotCommitted) {
Long zxid = packetsCommitted.peekFirst();
if (p.hdr.getZxid() != zxid) {
// log warning message if there is no matching commit
// old leader send outstanding proposal to observer
LOG.warn("Committing " + Long.toHexString(zxid)
+ ", but next proposal is "
+ Long.toHexString(p.hdr.getZxid()));
continue;
}
packetsCommitted.remove();
Request request = new Request(null, p.hdr.getClientId(),
p.hdr.getCxid(), p.hdr.getType(), null, null);
request.txn = p.rec;
request.hdr = p.hdr;
ozk.commitRequest(request);
}
} else {
// New server type need to handle in-flight packets
throw new UnsupportedOperationException("Unknown server type");
}
}
同步完启动服务器准备工作(处理链的逻辑)
public synchronized void startup() {
if (sessionTracker == null) {
createSessionTracker();
}
startSessionTracker();
// 这里比较重要,这里设置请求处理器,包括请求前置处理器,和请求后置处理器
// 注意,集群模式下,learner服务端都对调用这个方法,但是比如FollowerZookeeperServer和ObserverZooKeeperServer都会重写这个方法
// leader: 设置处理链、开启三个线程
// PrepRequestProcessor.next = ProposalRequestProcessor.next=CommitProcessor.next=ToBeAppliedRequestProcessor.next=FinalRequestProcessor
// PrepRequestProcessor.run
// CommitProcessor.run
// SynProcessorRequset.run
// follower: 设置处理链、开启三个线程
// FollowerRequestProcessor.next = CommitProcessor.next = FinalRequestProcessor
// SyncRequestProcessor.next = SendAckRequestProcessor
// FollowerRequestProcessor.run
// CommitProcessor.run
// SyncRequestProcessor.run
// observer:设置处理链、根据条件开启线程
// ObserverRequestProcessor.next = CommitProcessor.next= FinalRequestProcessor
// ObserverRequestProcessor.run
// CommitProcessor.run
// if (syncRequestProcessorEnabled) {
// SyncRequestProcessor.run
// }
setupRequestProcessors();
registerJMX();
setState(State.RUNNING);
notifyAll();
}
正常启动客户端与集群交互流程
step1 客户端直接发送数据给leader
step2 客户端发送给follower时的follower处理
step3 leader接受数据 发送ack过半提议给follower
step4 所有数据流转
废话:
如果直接看可能会比较累。建议拉一下我的代码。然后结合这个文档查看。源码枯燥。且看且珍惜。
github地址:https://github.com/WangTingYeYe/zookeeper-3.4.13-source