Zookeeper快速入门学习笔记（下）

最新推荐文章于 2024-06-15 00:00:00 发布

猿灰灰

最新推荐文章于 2024-06-15 00:00:00 发布

阅读量3w

点赞数 1

分类专栏： Technology-Stack 文章标签： java 后端

本文链接：https://blog.csdn.net/qq_45408390/article/details/123951731

版权

Technology-Stack 专栏收录该内容

40 篇文章 17 订阅

订阅专栏

第 5 章面试题

5.1 选举机制

半数机制，超过半数的投票通过，即通过。

第一次启动选举规则：投票过半数时，服务器 id 大的胜出
第二次启动选举规则：
- EPOCH 大的直接胜出
- EPOCH 相同，事务 id 大的胜出
- 事务 id 相同，服务器 id 大的胜出

5.2 生产集群安装多少zk合适

安装奇数台

生产经验：

10 台服务器：3 台 zk
20 台服务器：5 台 zk
100 台服务器：11 台 zk
200 台服务器：11 台 zk

服务器台数多：好处，提高可靠性；坏处：提高通信延时

常用命令：ls、get、create、delete

第 6 章源码分析

6.1 算法基础

6.1.1 拜占庭将军问题

问题：Zookeeper 是如何保证数据一致性的？

拜占庭将军问题是一个协议问题，拜占庭帝国军队的将军们必须全体一致的决定是否攻击某一支敌军。问题是这些将军在地理上是分隔开来的，并且将军中存在叛徒。叛徒可以任意行动以达到以下目标：欺骗某些将军采取进攻行动；促成一个不是所有将军都同意的决定，如当将军们不希望进攻时促成进攻行动；或者迷惑某些将军，使他们无法做出决定。如果叛徒达到了这些目的之一，则任何攻击行动的结果都是注定要失败的，只有完全达成一致的努力才能获得胜利

在这里插入图片描述

6.1.2 Paxos 算法

Paxos算法：一种基于消息传递且具有高度容错特性的一致性算法。

Paxos算法解决的问题：就是如何快速正确的在一个分布式系统中对某个数据值达成一致，并且保证不论发生任何异常，都不会破坏整个系统的一致性。

Paxos算法描述

在一个Paxos系统中，首先将所有节点划分为Proposer（提议者），Acceptor（接受者），和 Learner（学习者）。（注意：每个节点都可以身兼数职）。
一个完整的Paxos算法流程分为三个阶段：
- Prepare准备阶段
  - Proposer向多个Acceptor发出Propose请求Promise（承诺）
  - Acceptor针对收到的Propose请求进行Promise（承诺）
- Accept接受阶段
  - Proposer收到多数Acceptor承诺的Promise后，向Acceptor发出Propose请求
  - Acceptor针对收到的Propose请求进行Accept处理
- Learn学习阶段：Proposer将形成的决议发送给所有Learners

Paxos算法流程

在这里插入图片描述

Prepare: Proposer生成全局唯一且递增的Proposal ID，向所有Acceptor发送Propose请求，这里无需携带提案内容，只携带Proposal ID即可。
Promise: Acceptor收到Propose请求后，做出“两个承诺，一个应答”。
1. 不再接受Proposal ID小于等于（注意：这里是<= ）当前请求的Propose请求。
2. 不再接受Proposal ID小于（注意：这里是< ）当前请求的Accept请求。
3. 不违背以前做出的承诺下，回复已经Accept过的提案中Proposal ID最大的那个提案的Value和Proposal ID，没有则返回空值。
Propose: Proposer收到多数Acceptor的Promise应答后，从应答中选择Proposal ID最大的提案的Value，作为本次要发起的提案。如果所有应答的提案Value均为空值，则可以自己随意决定提案Value。然后携带当前Proposal ID，向所有Acceptor发送 Propose请求
Accept: Acceptor收到Propose请求后，在不违背自己之前做出的承诺下，接受并持久化当前Proposal ID和提案Value
Learn: Proposer收到多数Acceptor的Accept后，决议形成，将形成的决议发送给所有Learner。

Paxos算法流程——情况一

有A1, A2, A3, A4, A5 5位议员，就税率问题进行决议。

在这里插入图片描述

A1发起1号Proposal的Propose，等待Promise承诺；
A2-A5回应Promise；
A1在收到两份回复时就会发起税率10%的Proposal；
A2-A5回应Accept；
通过Proposal，税率10%

Paxos算法流程——情况二

现在我们假设在A1提出提案的同时, A5决定将税率定为20%

在这里插入图片描述

A1，A5同时发起Propose（序号分别为1，2）
A2承诺A1，A4承诺A5，A3行为成为关键
情况1：A3先收到A1消息，承诺A1。
A1发起Proposal（1，10%），A2，A3接受。
之后A3又收到A5消息，回复A1：（1，10%），并承诺A5。
A5发起Proposal（2，20%），A3，A4接受。之后A1，A5同时广播决议

Paxos 算法缺陷：在网络复杂的情况下，一个应用 Paxos 算法的分布式系统，可能很久无法收敛，甚至陷入活锁的情况。

Paxos算法流程——情况三

现在我们假设在A1提出提案的同时, A5决定将税率定为20%

在这里插入图片描述

A1，A5同时发起Propose（序号分别为1，2）
A2承诺A1，A4承诺A5，A3行为成为关键
情况2：A3先收到A1消息，承诺A1。之后立刻收到A5消息，承诺A5。
A1发起Proposal（1，10%），无足够响应，A1重新Propose （序号3），A3再次承诺A1。
A5发起Proposal（2，20%），无足够相应。 A5重新Propose （序号4），A3再次承诺A5。
…

造成这种情况的原因是系统中有一个以上的 Proposer，多个 Proposers 相互争夺 Acceptor，造成迟迟无法达成一致的情况。针对这种情况，一种改进的 Paxos 算法被提出：从系统中选出一个节点作为 Leader，只有 Leader 能够发起提案。这样，一次 Paxos 流程中只有一个 Proposer，不会出现活锁的情况，此时只会出现例子中第一种情况。

6.1.3 ZAB协议

ZAB算法简介

ZAB借鉴了Paxos算法，是特别为Zookeeper设计的支持崩溃恢复的原子广播协议。基于该协议，Zookeeper 设计为只有一台客户端（Leader）负责处理外部的写事务请求，然后Leader客户端将数据同步到其他 Follower 节点。即 Zookeeper 只有一个 Leader 可以发起提案

Zab协议内容

Zab 协议包括两种基本的模式：消息广播、崩溃恢复

消息广播

在这里插入图片描述

客户端发起一个写操作请求。
Leader服务器将客户端的请求转化为事务Proposal 提案，同时为每个Proposal 分配一个全局的ID，即zxid。
Leader服务器为每个Follower服务器分配一个单独的队列，然后将需要广播的 Proposal依次放到队列中去，并且根据FIFO策略进行消息发送。
Follower接收到Proposal后，会首先将其以事务日志的方式写入本地磁盘中，写入成功后向Leader反馈一个Ack响应消息。
Leader接收到超过半数以上Follower的Ack响应消息后，即认为消息发送成功，可以发送commit消息。
Leader向所有Follower广播commit消息，同时自身也会完成事务提交。Follower 接收到commit消息后，会将上一条事务提交。
Zookeeper采用Zab协议的核心，就是只要有一台服务器提交了Proposal，就要确保所有的服务器最终都能正确提交Proposal。

ZAB协议针对事务请求的处理过程类似于一个两阶段提交过程

（1）广播事务阶段

（2）广播提交操作

这两阶段提交模型如下，有可能因为Leader宕机带来数据不一致，比如

（ 1 ） Leader 发起一个事务 Proposal1 后就宕机， Follower 都没有 Proposal1

（2）Leader收到半数ACK宕机，没来得及向Follower发送Commit

怎么解决呢？ZAB引入了崩溃恢复模式。

崩溃恢复——异常假设

一旦Leader服务器出现崩溃或者由于网络原因导致Leader服务器失去了与过半 Follower的联系，那么就会进入崩溃恢复模式。

在这里插入图片描述

假设两种服务器异常情况：
- 假设一个事务在Leader提出之后，Leader挂了。
- 一个事务在Leader上提交了，并且过半的Follower都响应Ack了，但是Leader在Commit消息发出之前挂了。
Zab协议崩溃恢复要求满足以下两个要求：
- 确保已经被Leader提交的提案Proposal，必须最终被所有的Follower服务器提交。（已经产生的提案，Follower必须执行）
- 确保丢弃已经被Leader提出的，但是没有被提交的Proposal。（丢弃胎死腹中的提案）

崩溃恢复主要包括两部分：Leader选举和数据恢复

崩溃恢复——Leader选举

在这里插入图片描述

Leader选举：根据上述要求，Zab协议需要保证选举出来的Leader需要满足以下条件：

（1）新选举出来的Leader不能包含未提交的Proposal。即新Leader必须都是已经提交了Proposal的Follower服务器节点。

（2）新选举的Leader节点中含有最大的zxid。这样做的好处是可以避免Leader服务器检查Proposal的提交和丢弃工作。

崩溃恢复——数据恢复

在这里插入图片描述

Zab如何数据同步：

（1）完成Leader选举后，在正式开始工作之前（接收事务请求，然后提出新的Proposal），Leader服务器会首先确认事务日志中的所有的Proposal 是否已经被集群中过半的服务器Commit。

（2）Leader服务器需要确保所有的Follower服务器能够接收到每一条事务的Proposal，并且能将所有已经提交的事务Proposal 应用到内存数据中。等到Follower将所有尚未同步的事务Proposal都从Leader服务器上同步过，并且应用到内存数据中以后， Leader才会把该Follower加入到真正可用的Follower列表中。

拓展：Zab数据同步过程中，如何处理需要丢弃的Proposal？

在Zab的事务编号zxid设计中，zxid是一个64位的数字。其中低32位可以看成一个简单的单增计数器，针对客户端每一个事务请求，Leader在产生新的Proposal事务时，都会对该计数器加1。而高32位则代表了Leader周期的epoch编号。

epoch编号可以理解为当前集群所处的年代，或者周期。每次Leader变更之后都会在 epoch的基础上加1，这样旧的Leader 崩溃恢复之后，其他Follower也不会听它的了，因为 Follower只服从epoch最高的Leader命令。

每当选举产生一个新的 Leader，就会从这个Leader服务器上取出本地事务日志充最大编号Proposal的zxid，并从zxid中解析得到对应的epoch编号，然后再对其加1，之后该编号就作为新的epoch 值，并将低32位数字归零，由0开始重新生成zxid。

Zab协议通过epoch编号来区分Leader变化周期，能够有效避免不同的Leader错误的使用了相同的zxid编号提出了不一样的 Proposal的异常情况。

基于以上策略，当一个包含了上一个Leader周期中尚未提交过的事务Proposal的服务器启动时，当这台机器加入集群中，以Follower角色连上Leader服务器后，Leader 服务器会根据自己服务器上最后提交的 Proposal来和Follower服务器的Proposal 进行比对，比对的结果肯定是Leader要求Follower进行一个回退操作，回退到一个确实已经被集群中过半机器Commit的最新 Proposal。

6.1.4 CAP理论

CAP理论告诉我们，一个分布式系统不可能同时满足以下三种 CAP理论

一致性（C:Consistency）
可用性（A:Available）
分区容错性（P:Partition Tolerance）

这三个基本需求，最多只能同时满足其中的两项，因为P是必须的，因此往往选择就在CP或者AP中。

一致性（C:Consistency）在分布式环境中，一致性是指数据在多个副本之间是否能够保持数据一致的特性。在一致性的需求下，当一个系统在数据一致的状态下执行更新操作后，应该保证系统的数据仍然处于一致的状态。
可用性（A:Available）可用性是指系统提供的服务必须一直处于可用的状态，对于用户的每一个操作请求总是能够在有限的时间内返回结果。
分区容错性（P:Partition Tolerance） 分布式系统在遇到任何网络分区故障的时候，仍然需要能够保证对外提供满足一致性和可用性的服务，除非是整个网络环境都发生了故障

ZooKeeper保证的是CP

ZooKeeper不能保证每次服务请求的可用性。（注：在极端环境下，ZooKeeper可能会丢弃一些请求，消费者程序需要重新请求才能获得结果）。所以说，ZooKeeper不能保证服务可用性。
进行Leader选举时集群都是不可用。

6.2 源码详解

6.2.1 辅助源码

6.2.1.1 持久化源码

Leader 和 Follower 中的数据会在内存和磁盘中各保存一份。所以需要将内存中的数据持久化到磁盘中。
在 org.apache.zookeeper.server.persistence 包下的相关类都是序列化相关的代码

在这里插入图片描述

查看安装zookeeper是新设置的数据存储路径（/opt/module/zookeeper-3.5.7/zkData/version-2）

在这里插入图片描述

快照源码

package org.apache.zookeeper.server.persistence;

public interface SnapShot {
    // 反序列化方法
    long deserialize(DataTree dt, Map<Long, Integer> sessions)
        throws IOException;

    // 序列化方法
    void serialize(DataTree dt, Map<Long, Integer> sessions,
                   File name)
        throws IOException;

    /**
     * find the most recent snapshot file
     * 查找最近的快照文件
     */
    File findMostRecentSnapshot() throws IOException;

    // 释放资源
    void close() throws IOException;
}

操作日志源码

public interface TxnLog {
    // 设置服务状态
    void setServerStats(ServerStats serverStats);

    // 滚动日志
    void rollLog() throws IOException;
    // 追加
    boolean append(TxnHeader hdr, Record r) throws IOException;
    // 读取数据
    TxnIterator read(long zxid) throws IOException;

    // 获取最后一个 zxid
    long getLastLoggedZxid() throws IOException;

    // 删除日志
    boolean truncate(long zxid) throws IOException;

    // 获取 DbId
    long getDbId() throws IOException;

    // 提交
    void commit() throws IOException;
    // 日志同步时间
    long getTxnLogSyncElapsedTime();

    // 关闭日志
    void close() throws IOException;
    // 读取日志的接口
    public interface TxnIterator {
        // 获取头信息
        TxnHeader getHeader();

        // 获取传输的内容
        Record getTxn();

        // 下一条记录
        boolean next() throws IOException;

        // 关闭资源
        void close() throws IOException;

        // 获取存储的大小
        long getStorageSize() throws IOException;
    }
}

处理持久化的核心类

在这里插入图片描述

6.2.1.2 序列化源码

zookeeper-jute 代码是关于 Zookeeper 序列化相关源码

在这里插入图片描述

序列化和反序列化源码

@InterfaceAudience.Public
public interface Record {
    public void serialize(OutputArchive archive, String tag)
        throws IOException;
    public void deserialize(InputArchive archive, String tag)
        throws IOException;
}

迭代

public interface Index {
    public boolean done();
    public void incr();
}

序列化支持的数据类型

public interface OutputArchive {
    public void writeByte(byte b, String tag) throws IOException;
    public void writeBool(boolean b, String tag) throws IOException;
    public void writeInt(int i, String tag) throws IOException;
    public void writeLong(long l, String tag) throws IOException;
    public void writeFloat(float f, String tag) throws IOException;
    public void writeDouble(double d, String tag) throws IOException;
    public void writeString(String s, String tag) throws IOException;
    public void writeBuffer(byte buf[], String tag)
        throws IOException;
    public void writeRecord(Record r, String tag) throws IOException;
    public void startRecord(Record r, String tag) throws IOException;
    public void endRecord(Record r, String tag) throws IOException;
    public void startVector(List<?> v, String tag) throws IOException;
    public void endVector(List<?> v, String tag) throws IOException;
    public void startMap(TreeMap<?,?> v, String tag) throws IOException;
    public void endMap(TreeMap<?,?> v, String tag) throws IOException;

}

反序列化支持的数据类型

public interface InputArchive {
    public byte readByte(String tag) throws IOException;
    public boolean readBool(String tag) throws IOException;
    public int readInt(String tag) throws IOException;
    public long readLong(String tag) throws IOException;
    public float readFloat(String tag) throws IOException;
    public double readDouble(String tag) throws IOException;
    public String readString(String tag) throws IOException;
    public byte[] readBuffer(String tag) throws IOException;
    public void readRecord(Record r, String tag) throws IOException;
    public void startRecord(String tag) throws IOException;
    public void endRecord(String tag) throws IOException;
    public Index startVector(String tag) throws IOException;
    public void endVector(String tag) throws IOException;
    public Index startMap(String tag) throws IOException;
    public void endMap(String tag) throws IOException;
}

6.2.2 ZK 服务端初始化源码解析

在这里插入图片描述

6.2.2.1 ZK 服务端启动脚本分析

Zookeeper 服务的启动命令是 zkServer.sh start

查看zKServer.sh源码中重要的代码解析

在这里插入图片描述

. "$ZOOBINDIR"/zkEnv.sh  # 相当于获取 zkEnv.sh 中的环境变量（ZOOCFG="zoo.cfg"）

ZOOMAIN="-Dcom.sun.management.jmxremote - Dcom.sun.management.jmxremote.local.only=$JMXLOCALONLY org.apache.zookeeper.server.quorum.QuorumPeerMain"  # zookeeper主启动类QuorumPeerMain

ZOOMAIN= ...org.apache.zookeeper.server.quorum.QuorumPeerMain"

case $1 in
start)
...
nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-
Dzookeeper.log.dir=${ZOO_LOG_DIR}" \
 "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-
Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
 -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \
 -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" >
"$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &
... # zkServer.sh start 底层的实际执行内容

zkServer.sh start 底层的实际执行内容

nohup "$JAVA"
+ 一堆提交参数
+ $ZOOMAIN（org.apache.zookeeper.server.quorum.QuorumPeerMain）
+ "$ZOOCFG" # （zkEnv.sh 文件中 ZOOCFG="zoo.cfg"）

可知程序的入口是 QuorumPeerMain.java 类

6.2.2.2 ZK 服务端启动入口&解析参数 zoo.cfg 和 myid

图解

在这里插入图片描述

服务端启动入口QuorumPeerMain.java

main方法

public class QuorumPeerMain {
    ...
    public static void main(String[] args) {
        // 创建了一个 zk 节点
        QuorumPeerMain main = new QuorumPeerMain();
        try {
            // 初始化节点并运行，args 相当于提交参数中的 zoo.cfg
            main.initializeAndRun(args);
        }
    }
    ...
}

main.initializeAndRun(args)

protected void initializeAndRun(String[] args)
    throws ConfigException, IOException, AdminServerException
{
    // 管理 zk 的配置信息
    QuorumPeerConfig config = new QuorumPeerConfig();
    if (args.length == 1) {
        // 1 解析参数，zoo.cfg 和 myid
        config.parse(args[0]);
    }
    // 2 启动定时任务，对过期的快照，执行删除（默认该功能关闭）
    // Start and schedule the the purge task
    DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
                                                               .getDataDir(), config.getDataLogDir(), config
                                                               .getSnapRetainCount(), config.getPurgeInterval());
    purgeMgr.start();
    if (args.length == 1 && config.isDistributed()) {
        // 3 启动集群 （通信初始化）
        runFromConfig(config);
    } else {
        LOG.warn("Either no config or no quorum defined in config, running "
                 + " in standalone mode");
        // there is only server in the quorum -- run as standalone
        ZooKeeperServerMain.main(args);
    }
}

解析参数 zoo.cfg 和 myid

main.initializeAndRun(args)方法中调用了config.parse(args[0])来解析参数 zoo.cfg和myid

parse方法

public void parse(String path) throws ConfigException {
    LOG.info("Reading configuration from: " + path);

    try {
        // 校验文件路径及是否存在
        File configFile = (new VerifyingFileFactory.Builder(LOG)
                           .warnForRelativePath()
                           .failForNonExistingPath()
                           .build()).create(path);

        Properties cfg = new Properties();
        FileInputStream in = new FileInputStream(configFile);
        try {
            // 加载配置文件
            cfg.load(in);
            configFileStr = path;
        } finally {
            in.close();
        }
		// 解析配置文件
        parseProperties(cfg);
    } catch (IOException e) {
      ...
    }   
}

parseProperties(cfg)解析配置文件

在这里插入图片描述

public void parseProperties(Properties zkProp){
    for (Entry<Object, Object> entry : zkProp.entrySet()) {
        String key = entry.getKey().toString().trim();
        String value = entry.getValue().toString().trim();
        if (key.equals("dataDir")) {
            dataDir = vff.create(value);
        } else if (key.equals("dataLogDir")) {
            dataLogDir = vff.create(value);
        } else if (key.equals("clientPort")) {
            clientPort = Integer.parseInt(value);
        } else if (key.equals("localSessionsEnabled")) {
            localSessionsEnabled = Boolean.parseBoolean(value);
        } else if (key.equals("localSessionsUpgradingEnabled")) {
            localSessionsUpgradingEnabled = Boolean.parseBoolean(value);
        }
        ...
    }
    ....
    if (dynamicConfigFileStr == null) {
        // 解析myid的方法
        setupQuorumPeerConfig(zkProp, true);
        if (isDistributed() && isReconfigEnabled()) {
            // we don't backup static config for standalone mode.
            // we also don't backup if reconfig feature is disabled.
            backupOldConfig();
        }
    }
}

setupQuorumPeerConfig方法

void setupQuorumPeerConfig(Properties prop, boolean configBackwardCompatibilityMode)
    throws IOException, ConfigException {
    quorumVerifier = parseDynamicConfig(prop, electionAlg, true, configBackwardCompatibilityMode);
    // 设置myid
    setupMyId();
    setupClientPort();
    setupPeerType();
    checkValidity();
}

setupMyId方法

private void setupMyId() throws IOException {
    // 读取dataDir路径下的myid文件 也是我们搭建集群是创建的myid文件
    File myIdFile = new File(dataDir, "myid");
    // standalone server doesn't need myid file.
    // 单机运行zookeeper是不需要myid
    if (!myIdFile.isFile()) {
        return;
    }
    BufferedReader br = new BufferedReader(new FileReader(myIdFile));
    String myIdString;
    try {
        // 读取文件的一行
        myIdString = br.readLine();
    } finally {
        br.close();
    }
    try {
        // 将解析 myid 文件中的 id 赋值给 serverId
        serverId = Long.parseLong(myIdString);
        MDC.put("myid", myIdString);
    } catch (NumberFormatException e) {
        throw new IllegalArgumentException("serverid " + myIdString
                                           + " is not a number");
    }
}

6.2.2.3 过期快照删除

流程图解

在这里插入图片描述

可以启动定时任务，对过期的快照，执行删除。默认该功能时关闭的

initializeAndRun方法中

// 2 启动定时任务，对过期的快照，执行删除（默认是关闭）
// config.getSnapRetainCount() = 3 最少保留的快照个数
// config.getPurgeInterval() = 0 默认 0 表示关闭
// Start and schedule the the purge task
DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
                                                           .getDataDir(), config.getDataLogDir(), config
                                                           .getSnapRetainCount(), config.getPurgeInterval());
purgeMgr.start();


protected int snapRetainCount = 3;
protected int purgeInterval = 0;

start()方法

public void start() {
    if (PurgeTaskStatus.STARTED == purgeTaskStatus) {
        LOG.warn("Purge task is already running.");
        return;
    }
    // 默认情况 purgeInterval=0，该任务关闭，直接返回
    // Don't schedule the purge task with zero or negative purge interval.
    if (purgeInterval <= 0) {
        LOG.info("Purge task is not scheduled.");
        return;
    }
    // 创建一个定时器
    timer = new Timer("PurgeTask", true);
    // 创建一个清理快照任务
    // PurgeTask extends TimerTask implements Runnable 是一个线程
    TimerTask task = new PurgeTask(dataLogDir, snapDir, snapRetainCount);
    // 如果 purgeInterval 设置的值是 1，表示 1 小时检查一次，判断是否有过期快照，有则删除
        timer.scheduleAtFixedRate(task, 0, TimeUnit.HOURS.toMillis(purgeInterval));
    purgeTaskStatus = PurgeTaskStatus.STARTED;
}

PurgeTask线程中的run方法进行清理快照

@Override
public void run() {
    LOG.info("Purge task started.");
    try {
        // 清理过期的数据
        PurgeTxnLog.purge(logsDir, snapsDir, snapRetainCount);
    } catch (Exception e) {
        LOG.error("Error occurred while purging.", e);
    }
    LOG.info("Purge task completed.");
}

public static void purge(File dataDir, File snapDir, int num) throws IOException {
    if (num < 3) {
        throw new IllegalArgumentException(COUNT_ERR_MSG);
    }

    FileTxnSnapLog txnLog = new FileTxnSnapLog(dataDir, snapDir);

    List<File> snaps = txnLog.findNRecentSnapshots(num);
    int numSnaps = snaps.size();
    if (numSnaps > 0) {
        purgeOlderSnapshots(txnLog, snaps.get(numSnaps - 1));
    }
}

6.2.2.4 初始化通信组件&启动ZK

流程图解

在这里插入图片描述

初始化通信组件

initializeAndRun方法中调用的runFromConfig(config)

// 3 启动集群 （通信初始化）
runFromConfig(config);

通信协议默认 NIO（可以支持 Netty）

public void runFromConfig(QuorumPeerConfig config)
    throws IOException, AdminServerException
{
    … …
        LOG.info("Starting quorum peer");
    try {
        ServerCnxnFactory cnxnFactory = null;
        ServerCnxnFactory secureCnxnFactory = null;
        // 通信组件初始化，默认是 NIO 通信
        if (config.getClientPortAddress() != null) {
            cnxnFactory = ServerCnxnFactory.createFactory();
            cnxnFactory.configure(config.getClientPortAddress(),
                                  config.getMaxClientCnxns(), false);
        }
        if (config.getSecureClientPortAddress() != null) {
            secureCnxnFactory = ServerCnxnFactory.createFactory();
            secureCnxnFactory.configure(config.getSecureClientPortAddress(),
                                        config.getMaxClientCnxns(), true);
        }
        // 把解析的参数赋值给该 zookeeper 节点
        quorumPeer = getQuorumPeer();
       ...
        quorumPeer.setTickTime(config.getTickTime());
        quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
        quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
        quorumPeer.setInitLimit(config.getInitLimit());
        quorumPeer.setSyncLimit(config.getSyncLimit());
        quorumPeer.setConfigFileName(config.getConfigFilename());
        // 管理 zk 数据的存储
        quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
        quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
        if (config.getLastSeenQuorumVerifier()!=null) {
            quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
        }
        quorumPeer.initConfigInZKDatabase();
        // 管理 zk 的通信
        quorumPeer.setCnxnFactory(cnxnFactory);
        quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
        quorumPeer.setSslQuorum(config.isSslQuorum());
        quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
        quorumPeer.setLearnerType(config.getPeerType());
        quorumPeer.setSyncEnabled(config.getSyncEnabled());
        quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
        if (config.sslQuorumReloadCertFiles) {
            quorumPeer.getX509Util().enableCertFileReloading();
        }
        … …
            quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
        quorumPeer.initialize();

        // 启动 zk
        quorumPeer.start();
        quorumPeer.join();
    } catch (InterruptedException e) {
        // warn, but generally this is ok
        LOG.warn("Quorum Peer interrupted", e);
    }
}

ServerCnxnFactory.createFactory()方法

static public ServerCnxnFactory createFactory() throws IOException {
    String serverCnxnFactoryName =
        System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
    if (serverCnxnFactoryName == null) {
        serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
    }
    try {
        ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory)
            Class.forName(serverCnxnFactoryName)
            .getDeclaredConstructor().newInstance();
        LOG.info("Using {} as server connection factory", serverCnxnFactoryName);
        return serverCnxnFactory;
    } catch (Exception e) {
        IOException ioe = new IOException("Couldn't instantiate "
                                          + serverCnxnFactoryName);
        ioe.initCause(e);
        throw ioe;
    }
}

public static final String ZOOKEEPER_SERVER_CNXN_FACTORY = "zookeeper.serverCnxnFactory";


/**
 * 默认是NIO
 **/
zookeeperAdmin.md 文件中
* *serverCnxnFactory* :
(Java system property: **zookeeper.serverCnxnFactory**)
Specifies ServerCnxnFactory implementation.
 This should be set to `NettyServerCnxnFactory` in order to use TLS based server
communication.
Default is `NIOServerCnxnFactory`.

初始化 NIO 服务端 Socket（并未启动）

public void runFromConfig(QuorumPeerConfig config)
    ...
    if (config.getClientPortAddress() != null) {
        cnxnFactory = ServerCnxnFactory.createFactory();
        cnxnFactory.configure(config.getClientPortAddress(),
                              config.getMaxClientCnxns(),
                              false);
    }
}


@Override
public void configure(InetSocketAddress addr, int maxcc, boolean secure) throws IOException {
    ...
    // 初始化 NIO 服务端 socket，绑定 2181 端口，可以接收客户端请求
    this.ss = ServerSocketChannel.open();
    ss.socket().setReuseAddress(true);
    LOG.info("binding to port " + addr);
    // 绑定 2181 端口
    ss.socket().bind(addr);
    ss.configureBlocking(false);
    acceptThread = new AcceptThread(ss, addr, selectorThreads);
}

6.2.3 ZK 服务端加载数据源码解析

在这里插入图片描述

zk 中的数据模型，是一棵树，DataTree，每个节点，叫做 DataNode
zk 集群中的 DataTree 时刻保持状态同步
Zookeeper 集群中每个 zk 节点中，数据在内存和磁盘中都有一份完整的数据。
1. 内存数据：DataTree
2. 磁盘数据：快照文件 + 编辑日志

在这里插入图片描述

冷启动恢复数据

initializeAndRun() -->  runFromConfig(config) --> quorumPeer.start() --> loadDataBase() --> zkDb.loadDataBase() -->
snapLog.restore() -->

private void loadDataBase() {
    try {
        // 加载磁盘数据到内存，恢复 DataTree
        // zk 的操作分两种：事务操作和非事务操作
        // 事务操作：zk.cteate()；都会被分配一个全局唯一的 zxid，zxid 组成：64 位：（前 32 位：epoch 每个 leader 任期的代号；后 32 位：txid 为事务 id）
            // 非事务操作：zk.getData()
            // 数据恢复过程：
            // （1）从快照文件中恢复大部分数据，并得到一个 lastProcessZXid
            // （2）再从编辑日志中执行 replay，执行到最后一条日志并更新 lastProcessZXid
            // （3）最终得到，datatree 和 lastProcessZXid，表示数据恢复完成
            zkDb.loadDataBase();
        ...
}
        }
    // =============================================================================
    public long loadDataBase() throws IOException {
        long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,
                                    commitProposalPlaybackListener);
        initialized = true;
        return zxid;
    }
    // =============================================================================
    public long restore(DataTree dt, Map<Long, Integer> sessions,
                        PlayBackListener listener) throws IOException {
        // 恢复快照文件数据到 DataTree
        long deserializeResult = snapLog.deserialize(dt, sessions);
        FileTxnLog txnLog = new FileTxnLog(dataDir);
        RestoreFinalizer finalizer = () -> {
            // 恢复编辑日志数据到 DataTree
            long highestZxid = fastForwardFromEdits(dt, sessions, listener);
            return highestZxid;
        };
       ...
        return finalizer.run();
    }

恢复快照数据到 DataTree

restore() --> snapLog.deserialize(dt, sessions); --> deserialize(dt, sessions, ia) --> 
SerializeUtils.deserializeSnapshot(dt,ia,sessions) --> dt.deserialize(ia, "tree") -->

public long deserialize(DataTree dt, Map<Long, Integer> sessions)
    throws IOException {
    ...
        // 依次遍历每一个快照的数据
    for (int i = 0, snapListSize = snapList.size(); i < snapListSize; i++) {
        snap = snapList.get(i);
        LOG.info("Reading snapshot " + snap);
        // 反序列化环境准备
        try (InputStream snapIS = new BufferedInputStream(new FileInputStream(snap));
             CheckedInputStream crcIn = new CheckedInputStream(snapIS, new Adler32())) {
            InputArchive ia = BinaryInputArchive.getArchive(crcIn);
            // 反序列化，恢复数据到 DataTree
            deserialize(dt, sessions, ia);
            long checkSum = crcIn.getChecksum().getValue();
            long val = ia.readLong("val");
            if (val != checkSum) {
                throw new IOException("CRC corruption in snapshot :  " + snap);
            }
            foundValid = true;
            break;
        } catch (IOException e) {
            LOG.warn("problem reading snap file " + snap, e);
        }
    }
   ...
}

public void deserialize(DataTree dt, Map<Long, Integer> sessions,
   ....
    // 恢复快照数据到 DataTree
    SerializeUtils.deserializeSnapshot(dt,ia,sessions);
}

public static void deserializeSnapshot(DataTree dt,InputArchive ia,
 Map<Long, Integer> sessions) throws IOException {
...
// 恢复快照数据到 DataTree
 dt.deserialize(ia, "tree");
}

                        
public void deserialize(InputArchive ia, String tag) throws IOException {
    aclCache.deserialize(ia);
    nodes.clear();
    pTrie.clear();
    String path = ia.readString("path");
    // 从快照中恢复每一个 datanode 节点数据到 DataTree
    while (!"/".equals(path)) {
        // 每次循环创建一个节点对象
        DataNode node = new DataNode();
        ia.readRecord(node, "node");
        // 将 DataNode 恢复到 DataTree
        nodes.put(path, node);
        synchronized (node) {
            aclCache.addUsage(node.acl);
        }
        int lastSlash = path.lastIndexOf('/');
        if (lastSlash == -1) {
            root = node;
        } else {
            // 处理父节点
            String parentPath = path.substring(0, lastSlash);
            DataNode parent = nodes.get(parentPath);
            if (parent == null) {
                throw new IOException("Invalid Datatree, unable to find " +
                                      "parent " + parentPath + " of path " + path);
            }
            // 处理子节点
            parent.addChild(path.substring(lastSlash + 1));
            
            // 处理临时节点和永久节点
            long eowner = node.stat.getEphemeralOwner();
            EphemeralType ephemeralType = EphemeralType.get(eowner);
            if (ephemeralType == EphemeralType.CONTAINER) {
                containers.add(path);
            } else if (ephemeralType == EphemeralType.TTL) {
                ttls.add(path);
            } else if (eowner != 0) {
                HashSet<String> list = ephemerals.get(eowner);
                if (list == null) {
                    list = new HashSet<String>();
                    ephemerals.put(eowner, list);
                }
                list.add(path);
            }
        }
        path = ia.readString("path");
    }
    nodes.put("/", root);
    // we are done with deserializing the
    // the datatree
    // update the quotas - create path trie
    // and also update the stat nodes
    setupQuota();

    aclCache.purgeUnused();
}

恢复编辑日志到DataTree

回到 FileTxnSnapLog.java 类中的 restore 方法

public long restore(DataTree dt, Map<Long, Integer> sessions,
                    PlayBackListener listener) throws IOException {
    ...
        RestoreFinalizer finalizer = () -> {
        // 恢复编辑日志数据到 DataTree
        long highestZxid = fastForwardFromEdits(dt, sessions, listener);
        return highestZxid;
    };
}

//======================================================================================
 public long fastForwardFromEdits(DataTree dt, Map<Long, Integer> sessions,
                                     PlayBackListener listener) throws IOException {
     // 在此之前，已经从快照文件中恢复了大部分数据，接下来只需从快照的 zxid + 1位置开始恢复
        TxnIterator itr = txnLog.read(dt.lastProcessedZxid+1);
     // 快照中最大的 zxid，在执行编辑日志时，这个值会不断更新，直到所有操作执行完
        long highestZxid = dt.lastProcessedZxid;
        TxnHeader hdr;
        try {
            // 从 lastProcessedZxid 事务编号器开始，不断的从编辑日志中恢复剩下的还没有恢复的数据
            while (true) {
                // iterator points to
                // the first valid txn when initialized
                hdr = itr.getHeader();
                if (hdr == null) {
                    //empty logs
                    return dt.lastProcessedZxid;
                }
                if (hdr.getZxid() < highestZxid && highestZxid != 0) {
                    LOG.error("{}(highestZxid) > {}(next log) for type {}",
                            highestZxid, hdr.getZxid(), hdr.getType());
                } else {
                    highestZxid = hdr.getZxid();
                }
                try {
                    // 根据编辑日志恢复数据到 DataTree，每执行一次，对应的事务 id，highestZxid + 1
                    processTransaction(hdr,dt,sessions, itr.getTxn());
                } catch(KeeperException.NoNodeException e) {
                   throw new IOException("Failed to process transaction type: " +
                         hdr.getType() + " error: " + e.getMessage(), e);
                }
                listener.onTxnLoaded(hdr, itr.getTxn());
                if (!itr.next())
                    break;
            }
        } finally {
            if (itr != null) {
                itr.close();
            }
        }
        return highestZxid;
 }
// =============================================================================================
public void processTransaction(TxnHeader hdr,DataTree dt,
            Map<Long, Integer> sessions, Record txn)
        throws KeeperException.NoNodeException {
// 创建节点、删除节点和其他的各种事务操作等
            rc = dt.processTxn(hdr, txn);
}

6.2.4 ZK 选举源码解析

选举的流程图

在这里插入图片描述

6.2.4.1 选举准备工作源码解析

选举准备工作源码执行流程图

在这里插入图片描述

QuorumPeer.java

@Override
public synchronized void start() {
    ...
    // 选举准备
    startLeaderElection();
    super.start();
}

synchronized public void startLeaderElection() {
    try {
        if (getPeerState() == ServerState.LOOKING) {
            getPeerState() == ServerState.LOOKING) {
             // 创建选票
             // （1）选票组件：epoch（leader 的任期代号）、zxid（某个 leader 当选期间执行的事务编号）、myid（serverid）
             // （2）开始选票时，都是先投自己
            currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
        }
    }
    ...
    // 创建选举算法实例
    this.electionAlg = createElectionAlgorithm(electionType);
}
    
protected Election createElectionAlgorithm(int electionAlgorithm){
    Election le=null;

    //TODO: use a factory rather than a switch
    switch (electionAlgorithm) {
            //...
        case 3:
            // 1 创建 QuorumCnxnManager，负责选举过程中的所有网络通信
            QuorumCnxManager qcm = createCnxnManager();
            QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
            if (oldQcm != null) {
                LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
                oldQcm.halt();
            }
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                // 2 启动监听线程
                listener.start();
                // 3 准备开始选举
                FastLeaderElection fle = new FastLeaderElection(this, qcm);
                fle.start();
                le = fle;
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
    }
    return le;
}

// 网络通信组件初始化
public QuorumCnxManager(QuorumPeer self,
                        final long mySid,
                        Map<Long,QuorumPeer.QuorumServer> view,
                        QuorumAuthServer authServer,
                        QuorumAuthLearner authLearner,
                        int socketTimeout,
                        boolean listenOnAllIPs,
                        int quorumCnxnThreadsSize,
                        boolean quorumSaslAuthEnabled) {
    // 创建各种队列
    this.recvQueue = new ArrayBlockingQueue<Message>(RECV_CAPACITY);
    this.queueSendMap = new ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>>();
    this.senderWorkerMap = new ConcurrentHashMap<Long, SendWorker>();
    this.lastMessageSent = new ConcurrentHashMap<Long, ByteBuffer>();

    //...
}
    
// 监听线程初始化
// 点击 QuorumCnxManager.Listener，找到对应的 run 方法
/**
         * Sleeps on accept().
         */
@Override
public void run() {
    // 绑定服务器地址
    ss.bind(addr);
    // 死循环
    while (!shutdown) {
        try {
            // 阻塞，等待处理请求
            client = ss.accept();
        }
    }
}

// 准备选举
    public FastLeaderElection(QuorumPeer self, QuorumCnxManager manager){
        this.stop = false;
        this.manager = manager;
        starter(self, manager);
    }
    private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;
        // 初始化队列和信息
        sendqueue = new LinkedBlockingQueue<ToSend>();
        recvqueue = new LinkedBlockingQueue<Notification>();
        this.messenger = new Messenger(manager);
    }

6.2.4.2 选举执行工作源码解析

选举执行工作原理执行流程图

在这里插入图片描述

QuorumPeer.java

public synchronized void start() {
    ...
    // 执行选举
    super.start();
}

执行 super.start();就相当于执行 QuorumPeer.java类中的 run()方法

当 Zookeeper 启动后，首先都是 Looking 状态，通过选举，让其中一台服务器成为 Leader，其他的服务器成为 Follower

@Override
public void run() {
    //...
	 while (running) {
         switch (getPeerState()) {
             case LOOKING:
                 LOG.info("LOOKING");
                 //...
                 // 进行选举，选举结束，返回最终成为 Leader 胜选的那张选票
                 setCurrentVote(makeLEStrategy().lookForLeader());
                 break;
                 //...
             case FOLLOWING:
                 //...
                 setFollower(makeFollower(logFactory));
                 follower.followLeader();
                 //...
             case LEADING:
                 //...
                 setLeader(makeLeader(logFactory));
                 leader.lead();
                 //...
         }
     }
}

lookForLeader()的实现类 FastLeaderElection.java

public Vote lookForLeader() throws InterruptedException {
    //...
    try {
        // 正常启动中，所有其他服务器，都会给我发送一个投票
        // 保存每一个服务器的最新合法有效的投票
        HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
        // 存储合法选举之外的投票结果
        HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
        // 一次选举的最大等待时间，默认值是 0.2s
        int notTimeout = finalizeWait;
        // 每发起一轮选举，logicalclock++
        // 在没有合法的 epoch 数据之前，都使用逻辑时钟代替
        // 选举 leader 的规则：依次比较 epoch（任期） zxid（事务 id） serverid（myid） 谁大谁当选 leader
            synchronized(this){
            // 更新逻辑时钟，每进行一次选举，都需要更新逻辑时钟
            // logicalclock = epoch
            logicalclock.incrementAndGet();
            // 更新选票（serverid， zxid, epoch），
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }
        LOG.info("New election. My id = " + self.getId() +
                 ", proposed zxid=0x" + Long.toHexString(proposedZxid));
        // 广播选票，把自己的选票发给其他服务器
        sendNotifications();
        /*
         * Loop in which we exchange notifications until we find a leader
         */
        // 一轮一轮的选举直到选举成功
        while ((self.getPeerState() == ServerState.LOOKING) &&
               (!stop)){
            … …
        }
        return null;
    } finally {
        … …
    }
}

sendNotifications，广播选票，把自己的选票发给其他服务器

private void sendNotifications() {
    // 遍历投票参与者，给每台服务器发送选票
    for (long sid : self.getCurrentAndNextConfigVoters()) {
        QuorumVerifier qv = self.getQuorumVerifier();
        // 创建发送选票
        ToSend notmsg = new ToSend(ToSend.mType.notification,
                                   proposedLeader,
                                   proposedZxid,
                                   logicalclock.get(),
                                   QuorumPeer.ServerState.LOOKING,
                                   sid,
                                   proposedEpoch, qv.toString().getBytes());
      // ...
        // 把发送选票放入发送队列
        sendqueue.offer(notmsg);
    }
}

在 FastLeaderElection.java 类中查找 WorkerSender 线程，其中的run()方法

public void run() {
    while (!stop) {
        try {
            // 队列阻塞，时刻准备接收要发送的选票
            ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
            if(m == null) continue;
            // 处理要发送的选票
            process(m);
        } catch (InterruptedException e) {
            break;
        }
    }
    LOG.info("WorkerSender is down");
}

void process(ToSend m) {
    ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
                                        m.leader,
                                        m.zxid,
                                        m.electionEpoch,
                                        m.peerEpoch,
                                        m.configData);
    // 发送选票
    manager.toSend(m.sid, requestBuffer);
}

public void toSend(Long sid, ByteBuffer b) {
    /*
     * If sending message to myself, then simply enqueue it (loopback).
     */
    // 判断如果是发给自己的消息，直接进入自己的 RecvQueue
    if (this.mySid == sid) {
        b.position(0);
        addToRecvQueue(new Message(b.duplicate(), sid));
        /*
         * Otherwise send to the corresponding thread to send.
         */
    } else {
        /*
         * Start a new connection if doesn't have one already.
         */
        // 如果是发给其他服务器，创建对应的发送队列或者获取已经存在的发送队列 ，并把要发送的消息放入该队列
        ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
        ArrayBlockingQueue<ByteBuffer> oldq = queueSendMap.putIfAbsent(sid, bq);
        if (oldq != null) {
            addToSendQueue(oldq, b);
        } else {
            addToSendQueue(bq, b);
        }
        // 将选票发送出去
        connectOne(sid);
    }
}

// 如果数据是发送给自己的，添加到自己的接收队列
public void addToRecvQueue(Message msg) {
    // ...
    // 将发送给自己的选票添加到 recvQueue 队列
    recvQueue.add(msg);
    // ...  
}

// 数据添加到发送队列
private void addToSendQueue(ArrayBlockingQueue<ByteBuffer> queue,ByteBuffer buffer) {
    // 将要发送的消息添加到发送队列
    queue.add(buffer);

}

与要发送的服务器节点建立通信连接

/**
 * connectOne(long sid) -> connectOne(sid, lastProposedView.get(sid).electionAddr)
 * -> initiateConnection(final Socket sock, final Long sid) -> startConnection(sock, sid) ->
 * 
 */
synchronized void connectOne(long sid){
    if (senderWorkerMap.get(sid) != null) {
        LOG.debug("There is a connection already for server " + sid);
        return;
    }
    synchronized (self.QV_LOCK) {
        boolean knownId = false;
        // Resolve hostname for the remote server before attempting to
        // connect in case the underlying ip address has changed.
        self.recreateSocketAddresses(sid);
        Map<Long, QuorumPeer.QuorumServer> lastCommittedView = self.getView();
        QuorumVerifier lastSeenQV = self.getLastSeenQuorumVerifier();
        Map<Long, QuorumPeer.QuorumServer> lastProposedView = lastSeenQV.getAllMembers();
        if (lastCommittedView.containsKey(sid)) {
            knownId = true;
            if (connectOne(sid, lastCommittedView.get(sid).electionAddr))
                return;
        }
        if (lastSeenQV != null && lastProposedView.containsKey(sid)
            && (!knownId || (lastProposedView.get(sid).electionAddr !=
                             lastCommittedView.get(sid).electionAddr))) {
            knownId = true;
            if (connectOne(sid, lastProposedView.get(sid).electionAddr))
                return;
        }
        if (!knownId) {
            LOG.warn("Invalid server id: " + sid);
            return;
        }
    }
}


private boolean startConnection(Socket sock, Long sid)
    throws IOException {
    DataOutputStream dout = null;
    DataInputStream din = null;
    try {
        // Use BufferedOutputStream to reduce the number of IP packets. This is
        // important for x-DC scenarios.
        // 通过输出流，向服务器发送数据
        BufferedOutputStream buf = new BufferedOutputStream(sock.getOutputStream());
        dout = new DataOutputStream(buf);

        // Sending id and challenge
        // represents protocol version (in other words - message type)
        dout.writeLong(PROTOCOL_VERSION);
        dout.writeLong(self.getId());
        String addr = formatInetAddr(self.getElectionAddress());
        byte[] addr_bytes = addr.getBytes();
        dout.writeInt(addr_bytes.length);
        dout.write(addr_bytes);
        dout.flush();

        // 通过输入流读取对方发送过来的选票
        din = new DataInputStream(
            new BufferedInputStream(sock.getInputStream()));
    } catch (IOException e) {
        LOG.warn("Ignoring exception reading or writing challenge: ", e);
        closeSocket(sock);
        return false;
    }

    // authenticate learner
    QuorumPeer.QuorumServer qps = self.getVotingView().get(sid);
    if (qps != null) {
        // TODO - investigate why reconfig makes qps null.
        authLearner.authenticate(sock, qps.hostname);
    }

    // If lost the challenge, then drop the new connection
    // 如果对方的 id 比我的大，我是没有资格给对方发送连接请求的，直接关闭自己的客户端
    if (sid > self.getId()) {
        LOG.info("Have smaller server identifier, so dropping the " +
                 "connection: (" + sid + ", " + self.getId() + ")");
        closeSocket(sock);
        // Otherwise proceed with the connection
    } else {
        // 初始化，发送器 和 接收器
        SendWorker sw = new SendWorker(sock, sid);
        RecvWorker rw = new RecvWorker(sock, din, sid, sw);
        sw.setRecv(rw);

        SendWorker vsw = senderWorkerMap.get(sid);

        if(vsw != null)
            vsw.finish();

        senderWorkerMap.put(sid, sw);
        queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(
            SEND_CAPACITY));

        // 启动发送器线程和接收器线程
        sw.start();
        rw.start();

        return true;

    }
    return false;
}

查看SendWorker，并查找该类下的 run 方法

@Override
public void run() {
    //...
    try {
        // 只要连接没有断开
        while (running && !shutdown && sock != null) {

            ByteBuffer b = null;
            try {
                ArrayBlockingQueue<ByteBuffer> bq = queueSendMap
                    .get(sid);
                if (bq != null) {
                    // 不断从发送队列 SendQueue 中，获取发送消息，并执行发送
                    b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
                } else {
                    LOG.error("No queue of incoming messages for " +
                              "server " + sid);
                    break;
                }

                if(b != null){
                    // 更新对于 sid 这台服务器的最近一条消息
                    lastMessageSent.put(sid, b);
                    // 执行发送
                    send(b);
                }
            } catch (InterruptedException e) {
                LOG.warn("Interrupted while waiting for message on queue",
                         e);
            }
        }
    } 
    // ...
}

synchronized void send(ByteBuffer b) throws IOException {
    byte[] msgBytes = new byte[b.capacity()];
    try {
        b.position(0);
        b.get(msgBytes);
    } catch (BufferUnderflowException be) {
        LOG.error("BufferUnderflowException ", be);
        return;
    }
    // 输出流向外发送
    dout.writeInt(b.capacity());
    dout.write(b.array());
    dout.flush();
}

查看RecvWorker，并查找该类下的 run 方法

@Override
public void run() {
    threadCnt.incrementAndGet();
    try {
        // 只要连接没有断开
        while (running && !shutdown && sock != null) {
            int length = din.readInt();
            if (length <= 0 || length > PACKETMAXSIZE) {
                throw new IOException(
                    "Received packet with invalid packet: "
                    + length);
            }
            byte[] msgArray = new byte[length];
            // 输入流接收消息
            din.readFully(msgArray, 0, length);
            ByteBuffer message = ByteBuffer.wrap(msgArray);
            // 接收对方发送过来的选票
            addToRecvQueue(new Message(message.duplicate(), sid));
        }
    }//...
}

public void addToRecvQueue(Message msg) {
    synchronized(recvQLock) {
        // ...
        try {
            // 将接收到的消息，放入接收消息队列
            recvQueue.add(msg);
        } catch (IllegalStateException ie) {
            // This should never happen
            LOG.error("Unable to insert element in the recvQueue " + ie);
        }
    }
}

在 FastLeaderElection.java 类中查找 WorkerReceiver 线程，其中的run()方法

public void run() {

    Message response;
    while (!stop) {
		// 从 RecvQueue 中取出选举投票消息（其他服务器发送过来的）
        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
    }
}

6.2.5 Follower 和 Leader 状态同步源码

同步流程概述

当选举结束后，每个节点都需要根据自己的角色更新自己的状态。选举出的 Leader 更新自己状态为 Leader，其他节点更新自己状态为 Follower
- Leader 更新状态入口：leader.lead()
- Follower 更新状态入口：follower.followerLeader()
注意：
- follower 必须要让 leader 知道自己的状态：epoch、zxid、sid
  - 必须要找出谁是 leader
  - 发起请求连接 leader
  - 发送自己的信息给 leader
  - leader 接收到信息，必须要返回对应的信息给 follower
- 当leader得知follower的状态了，就确定需要做何种方式的数据同步DIFF、TRUNC、 SNAP
- 执行数据同步
- 当 leader 接收到超过半数 follower 的 ack 之后，进入正常工作状态，集群启动完成了
最终总结同步的方式
- DIFF咱两一样，不需要做什么
- TRUNC follower 的 zxid 比 leader 的 zxid 大，所以 Follower 要回滚
- COMMIT leader 的 zxid 比 follower 的 zxid 大，发送 Proposal 给 foloower 提交执行
- 如果 follower 并没有任何数据，直接使用 SNAP 的方式来执行数据同步（直接把数据全部序列到 follower）

同步流程图解

在这里插入图片描述

源码详解

QuorumPeer.java中的run()方法中

case FOLLOWING:
    follower.followLeader();
break;
case LEADING:
    leader.lead();
break;

Leader.lead()等待接收 follower 的状态同步申请，在 Leader.java 中查看 lead()方法

void lead() throws IOException, InterruptedException {
    //...
    try {
        self.tick.set(0);
        // 恢复数据到内存，启动时，其实已经加载过了
        zk.loadData();
        leaderStateSummary = new StateSummary(self.getCurrentEpoch(),
                                              zk.getLastProcessedZxid());
        // Start thread that waits for connection requests from
        // new followers.
        // 等待其他 follower 节点向 leader 节点发送同步状态
        cnxAcceptor = new LearnerCnxAcceptor();
        cnxAcceptor.start();
        long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
        … …
    } finally {
        zk.unregisterJMX(this);
    }
}

cnxAcceptor.start()启动线程，查看LearnerCnxAcceptor中的run()方法

@Override
public void run() {
    try {
        while (!stop) {
            Socket s = null;
            boolean error = false;
            try {
                // 等待接收 follower 的状态同步申请
                s = ss.accept();

                // start with the initLimit, once the ack is processed
                // in LearnerHandler switch to the syncLimit
                s.setSoTimeout(self.tickTime * self.initLimit);
                s.setTcpNoDelay(nodelay);

                BufferedInputStream is = new BufferedInputStream(s.getInputStream());
                // 一旦接收到 follower 的请求，就创建 LearnerHandler 对象，处理请求
                LearnerHandler fh = new LearnerHandler(s, is, Leader.this);
                // 启动线程
                fh.start();
            } catch (SocketException e) {
                //...
            } finally {
               //...
            }
        }
    } // ...
}

其中 ss 的初始化是在创建 Leader 对象时，创建的 socket

Leader(QuorumPeer self,LeaderZooKeeperServer zk) throws IOException {
    //...
    if (self.getQuorumListenOnAllIPs()) {
        ss = new ServerSocket(self.getQuorumAddress().getPort());
    } else {
        ss = new ServerSocket();
    }
    //...
}

Follower.followLeader()查找并连接 Leader

void followLeader() throws InterruptedException {
    // ...
    // 查找 leader
    QuorumServer leaderServer = findLeader();            
    try {
        // 连接 leader
        connectToLeader(leaderServer.addr, leaderServer.hostname);
        // 向 leader 注册
        long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
        //...       
        QuorumPacket qp = new QuorumPacket();
        while (this.isRunning()) {
            // 读取 packet 信息
            readPacket(qp);
            // 处理 packet 消息
            processPacket(qp);
        }
		//...  
    }
}

protected QuorumServer findLeader() {
    QuorumServer leaderServer = null;
    // 获得选举投票的时候记录的，最后推荐的 leader 的 sid
    Vote current = self.getCurrentVote();
    // 如果这个 sid 在启动的所有服务器范围中
    for (QuorumServer s : self.getView().values()) {
        if (s.id == current.getId()) {
            // 尝试连接 leader 的正确 IP 地址
            s.recreateSocketAddresses();
            leaderServer = s;
            break;
        }
    }
    if (leaderServer == null) {
        LOG.warn("Couldn't find the leader with id = "
                 + current.getId());
    }
    return leaderServer;
}

 protected void connectToLeader(InetSocketAddress addr, String hostname)
            throws IOException, InterruptedException, X509Exception {
      // 连接
      sockConnect(sock, addr, Math.min(self.tickTime * self.syncLimit, remainingInitLimitTime));
 }

Leader.lead()创建 LearnerHandler且cnxAcceptor.start()启动线程，查看LearnerHandler的run()方法

public void run() {
    try {
        leader.addLearnerHandler(this);
        // 心跳处理
        tickOfNextAckDeadline = leader.self.tick.get()
            + leader.self.initLimit + leader.self.syncLimit;
        ia = BinaryInputArchive.getArchive(bufferedInput);
        bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
        oa = BinaryOutputArchive.getArchive(bufferedOutput);
        // 从网络中接收消息，并反序列化为 packet
        QuorumPacket qp = new QuorumPacket();
        ia.readRecord(qp, "packet");
        // 选举结束后，observer 和 follower 都应该给 leader 发送一个标志信息：FOLLOWERINFO 或者 OBSERVERINFO
        if(qp.getType() != Leader.FOLLOWERINFO && qp.getType() !=
           Leader.OBSERVERINFO){
            LOG.error("First packet " + qp.toString()
                      + " is not FOLLOWERINFO or OBSERVERINFO!");
            return;
        }
        byte learnerInfoData[] = qp.getData();
        //....
        // 读取 Follower 发送过来的 lastAcceptedEpoch
        // 选举过程中，所使用的 epoch，其实还是上一任 leader 的 epoch
        long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
        long peerLastZxid;
        StateSummary ss = null;
        // 读取 follower 发送过来的 zxid
        long zxid = qp.getZxid();
        // Leader 根据从 Follower 获取 sid 和旧的 epoch，构建新的 epoch
        long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
        long newLeaderZxid = ZxidUtils.makeZxid(newEpoch, 0);
        if (this.getVersion() < 0x10000) {
            // we are going to have to extrapolate the epoch information
            long epoch = ZxidUtils.getEpochFromZxid(zxid);
            ss = new StateSummary(epoch, zxid);
            // fake the message
            leader.waitForEpochAck(this.getSid(), ss);
        } else {
            byte ver[] = new byte[4];
            ByteBuffer.wrap(ver).putInt(0x10000);
            // Leader 向 Follower 发送信息（包含:zxid 和 newEpoch）
            QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO,
                                                           newLeaderZxid, ver, null);
            oa.writeRecord(newEpochPacket, "packet");
            bufferedOutput.flush();
            QuorumPacket ackEpochPacket = new QuorumPacket();
            // 接收follower应答的ackepoch
            ia.readRecord(ackEpochPacket, "packet");
            if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
                LOG.error(ackEpochPacket.toString()
                          + " is not ACKEPOCH");
                return;
            }
            ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
            // 保存了对方 follower 或者 observer 的状态：epoch 和 zxid
            ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
            // 判断是否收到follower的应答
            leader.waitForEpochAck(this.getSid(), ss);
        }
        peerLastZxid = ss.getLastZxid();

        // 判断Leader和Follower是否要同步
        boolean needSnap = syncFollower(peerLastZxid, leader.zk.getZKDatabase(),
                                        leader);

        /* if we are not truncating or sending a diff just send a snapshot */
        if (needSnap) {
            boolean exemptFromThrottle = getLearnerType() != LearnerType.OBSERVER;
            LearnerSnapshot snapshot =

                leader.getLearnerSnapshotThrottler().beginSnapshot(exemptFromThrottle);
            try {
                long zxidToSend =
                    leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
                oa.writeRecord(new QuorumPacket(Leader.SNAP, zxidToSend, null,
                                                null), "packet");
                bufferedOutput.flush();
                LOG.info("Sending snapshot last zxid of peer is 0x{}, zxid of leader is
                         0x{}, "
                         + "send zxid of db as 0x{}, {} concurrent snapshots, "
                         + "snapshot was {} from throttle",
                         Long.toHexString(peerLastZxid),
                         Long.toHexString(leaderLastZxid),
                         Long.toHexString(zxidToSend),
                         snapshot.getConcurrentSnapshotNumber(),
                         snapshot.isEssential() ? "exempt" : "not exempt");
                // Dump data to peer
                leader.zk.getZKDatabase().serializeSnapshot(oa);
                oa.writeString("BenWasHere", "signature");
                bufferedOutput.flush();
            } finally {
                snapshot.close();
            }
        }
        // ...
    } catch (IOException e) {
        ... ...
    } finally {
        ... ...
    }
}

Follower.follower()创建 registerWithLeader

protected long registerWithLeader(int pktType) throws IOException{
    /*
 * Send follower info, including last zxid and sid
 */
    long lastLoggedZxid = self.getLastLoggedZxid();
    QuorumPacket qp = new QuorumPacket();
    qp.setType(pktType);
    qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));

    /*
 * Add sid to payload
 */
    LearnerInfo li = new LearnerInfo(self.getId(), 0x10000,
                                     self.getQuorumVerifier().getVersion());
    ByteArrayOutputStream bsid = new ByteArrayOutputStream();
    BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
    boa.writeRecord(li, "LearnerInfo");
    qp.setData(bsid.toByteArray());

    // 发送 FollowerInfo 给 Leader
    writePacket(qp, true);
    // 读取 Leader 返回的结果：LeaderInfo
    readPacket(qp);

    final long newEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
    // 如果接收到 LeaderInfo
    if (qp.getType() == Leader.LEADERINFO) {
        // we are connected to a 1.0 server so accept the new epoch and read the next packet
        leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
        byte epochBytes[] = new byte[4];
        final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);
        // 接收 leader 的 epoch
        if (newEpoch > self.getAcceptedEpoch()) {
            // 把自己原来的 epoch 保存在 wrappedEpochBytes 里
            wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
            // 把 leader 发送过来的 epoch 保存起来
            self.setAcceptedEpoch(newEpoch);
        } else if (newEpoch == self.getAcceptedEpoch()) {
            // since we have already acked an epoch equal to the leaders, we cannot ack
            // again, but we still need to send our lastZxid to the leader so that we can
            // sync with it if it does assume leadership of the epoch.
            // the -1 indicates that this reply should not count as an ack for the new epoch
            wrappedEpochBytes.putInt(-1);
        } else {
            throw new IOException("Leaders epoch, " + newEpoch + " is less than accepted epoch, " + self.getAcceptedEpoch());
        }
        // 发送 ackepoch 给 leader（包含了自己的：epoch 和 zxid）
        QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH,
                                                    lastLoggedZxid, epochBytes, null);
        writePacket(ackNewEpoch, true);
        return ZxidUtils.makeZxid(newEpoch, 0);
    } else {
        if (newEpoch > self.getAcceptedEpoch()) {
            self.setAcceptedEpoch(newEpoch);
        }
        if (qp.getType() != Leader.NEWLEADER) {
            LOG.error("First packet should have been NEWLEADER");
            throw new IOException("First packet should have been NEWLEADER");
        }
        return qp.getZxid();
    }
}