深入分析Zookeeper的Leader选举原理

八五年的湘哥

已于 2023-11-02 16:36:10 修改

阅读量2.3k

点赞数 3

分类专栏： # Zookeeper 文章标签： zookeeper java 分布式

于 2021-02-22 16:41:00 首次发布

本文链接：https://blog.csdn.net/huxiang19851114/article/details/113943209

版权

Zookeeper 专栏收录该内容

4 篇文章 2 订阅

订阅专栏

1、Zookeeper 集群角色

Leader 角色：Leader 服务器是整个 zookeeper 集群的核心，主要的工作任务有两项:

1. 事务请求的唯一调度和事务处理者，保证集群事务处理的顺序性

2. 集群内部各服务器的调度者

　　Follower 角色：Follower 角色的主要职责是

1. 处理客户端非事务请求、转发事务请求给 leader 服务器

2. 参与事务请求 Proposal 的投票（需要半数以上服务器通过才能通知 leader commit 数据; Leader 发起的提案，要求 Follower 投票）

3. 参与 Leader 选举的投票

　　Observer 角色：Observer 是 zookeeper3.3 开始引入的一个全新的服务器角色，从字面来理解，该角色充当了观察者的角色。观察 zookeeper 集群中的最新状态变化并将这些状态变化同步到 observer 服务器上。Observer 的工作原理与follower 角色基本一致，而它和 follower 角色唯一的不同在于 observer 不参与任何形式的投票，包括事务请求Proposal的投票和leader选举的投票。简单来说，observer服务器只提供非事务请求服务，通常在于不影响集群事务处理能力的前提下提升集群非事务处理的能力

2、ZAB协议

上一章节中我们测试了Zookeeper集群的选举特点，这一切的由来都是因为ZAB协议！

ZAB（Zookeeper Atomic Broadcast）协议是为分布式协调服务 ZooKeeper 专门设计的一种支持崩溃恢复的原子广播协议。在 ZooKeeper 中，主要依赖 ZAB 协议来实现分布式数据一致性，基于该协议，ZooKeeper 实现了一种主备模式的系统架构来保持集群中各个副本之间的数据一致性。ZAB 协议包含两种基本模式，分别是：

原子广播 :消息广播的过程实际上是一个简化版本的二阶段提交过程。如下图：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

当某一个follower接收到事务操作请求时，它会将该请求发送给leader去执行（当然，leader直接接收到的事务操作请求也行）
leader 接收到消息请求后，将消息赋予一个全局唯一的自增 id，叫：zxid，通过 zxid 的大小比较即可以实现全局有序这个特征
leader 为每个 follower 准备了一个 FIFO 队列（通过 TCP协议来实现，以实现了全局有序这一个特点）将带有 zxid的消息作为一个提案（proposal）分发给所有的 follower
当 follower 接收到 proposal，先把 proposal 写到磁盘，写入成功以后再向 leader 回复一个 ack
当 leader 接收到合法数量（超过半数节点）的 ACK 后，leader 就会向这些 follower 发送 commit 命令，同时会在本地执行该消息
当 follower 收到消息的 commit 命令以后，会提交该消息。
反之，如果超过syncTimes(主从通信心跳次数)，leader接收少于半数follower节点ack，那么执行rollback命令，follower节点也会撤销本地保存的提案！

　　leader 的投票过程，不需要 Observer 的 ack，也就是Observer 不需要参与投票过程，但是 Observer 必须要同步 Leader 的数据从而在处理请求的时候保证数据的一致性

崩溃恢复：当整个集群在启动时，或者当 leader 节点出现网络中断、崩溃等情况时，ZAB 协议就会进入恢复模式并选举产生新的 Leader，当 leader 服务器选举出来后，并且集群中有过半的机器和该 leader 节点完成数据同步后（同步指的是数据同步，用来保证集群中过半的机器能够和 leader 服务器的数据状态保持一致），ZAB 协议就会退出恢复模式。当集群中已经有过半的 Follower 节点完成了和 Leader 状态同步以后，那么整个集群就进入了消息广播模式。这个时候，在 Leader 节点正常工作时，启动一台新的服务器加入到集群，那这个服务器会直接进入数据恢复模式，和 leader 节点进行数据同步。同步完成后即可正常对外提供非事务请求的处理。

1. 已经被处理的消息不能丢失：当 leader 收到合法数量 follower 的 ACK 后，就向各个 follower 广播 COMMIT 命令，同时也会在本地执行 COMMIT 并向连接的客户端返回成功提示。但是如果在各个 follower 在收到 COMMIT 命令前leader就挂了，导致剩下的服务器并没有执行这条消息。比如leader 对事务消息发起 commit 操作，但是该消息在follower1 上执行了，但是 follower2 还没有收到 commit，就已经挂了，而实际上客户端已经收到该事务消息处理成功的回执了。所以在 zab 协议下需要保证所有机器都要执行这个事务消息

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

　　2. 被丢弃的消息不能再次出现：当 leader 接收到消息请求生成 proposal 后就挂了，其他 follower 并没有收到此 proposal，因此经过恢复模式重新选了 leader 后，这条消息是被跳过的。此时，之前挂了的 leader 重新启动并注册成了follower，他保留了被跳过消息的 proposal 状态，与整个系统的状态是不一致的，需要将其删除。

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

ZAB 协议需要满足上面两种情况，就必须要设计一个leader 选举算法：能够确保已经被 leader 提交的事务Proposal能够提交、同时丢弃已经被跳过的事务Proposal。针对这个要求

第一个问题解决方案：如果 leader 选举算法能够保证新选举出来的 Leader 服务器拥有集群中所有机器最高编号（ZXID 最大）的事务Proposal，那么就可以保证这个新选举出来的 Leader 一定具有已经提交的提案。因为所有提案被 COMMIT 之前必须有超过半数的 follower ACK，即必须有超过半数节点的服务器的事务日志上有该提案的 proposal，因此，只要有合法数量的节点正常工作，就必然有一个节点保存了所有被 COMMIT 消息的 proposal 状态

第二个问题解决方案：zxid 是 64 位，高 32 位是 epoch 编号，每经过一次 Leader 选举产生一个新的 leader，新的 leader 会将epoch 号+1，低 32 位是消息计数器，每接收到一条消息这个值+1，新 leader 选举后这个值重置为 0.这样设计的好处在于老的 leader 挂了以后重启，它不会被选举为 leader，因此此时它的 zxid 肯定小于当前新的 leader。当老的leader 作为 follower 接入新的 leader 后，新的 leader 会让它将所有的拥有旧的 epoch 号但未被 COMMIT 的proposal 清除

3、什么是ZXID

zxid，也就是事务 id，为了保证事务的顺序一致性，zookeeper 采用了递增的事务 id 号（zxid）来标识事务。所有的提议（proposal）都在被提出的时候加上了 zxid。实现中 zxid 是一个 64 位的数字，它高 32 位是 epoch（ZAB 协议通过epoch 编号来区分 Leader 周期变化的策略）用来标识 leader 关系是否改变，每次一个 leader 被选出来，它都会有一个新的epoch（原来的 epoch+1），标识当前属于哪个 leader 的统治时期。低 32 位用于递增计数。

epoch：可以理解为当前集群所处的年代或者周期，每个leader 就像皇帝，都有自己的年号，所以每次改朝换代，leader 变更之后，都会在前一个年代的基础上加 1。这样就算旧的 leader 崩溃恢复之后，也没有人听他的了，因为follower 只听从当前年代的 leader 的命令。

第一个朝代最近一个事务没有同步到集群中时：

    •leader: 00000000000000000000000000000001 + 00000000000000000000000000000012
    •follower1: 00000000000000000000000000000001 + 00000000000000000000000000000012 //完成事务同步commit
    •follower1: 00000000000000000000000000000001 + 00000000000000000000000000000011 //还没完成事务同步commit

leader崩溃，重新选举：
    •leader(dead): 00000000000000000000000000000001 + 00000000000000000000000000000012
    •leader(new): 00000000000000000000000000000002 + 00000000000000000000000000000001
    •follower2: 00000000000000000000000000000002 + 00000000000000000000000000000000
 
leader(dead)重新启动，因为集群中低32位已经变了，只能加入集群：
    •follower3: 00000000000000000000000000000002 + 00000000000000000000000000000000
    •leader(new): 00000000000000000000000000000002 + 00000000000000000000000000000002
    •follower2: 00000000000000000000000000000002 + 00000000000000000000000000000001

epoch 的变化大家可以做一个简单的实验：

启动一个 zookeeper 集群。
在zookeeper配置的dataDir目录中找到version-2路径下会看到一个currentEpoch 文件文件中显示的是当前的 epoch（朝代）
把 leader 节点停机，这个时候在看 currentEpoch 会有变化随着每次选举新的 leader，epoch 都会发生变化

4、Leader选举

4.1 集群初始化选举

在集群初始化阶段，当有一台服务器 Server1 启动时，它本身是无法进行和完成 Leader 选举，当第二台服务器 Server2 启动时，这个时候两台机器可以相互通信，每台机器都试图找到 Leader，于是进入 Leader 选举过程。选举过程如下:

　　(1) 每个 Server 发出一个投票。由于是初始情况，Server1和 Server2 都会将自己作为 Leader 服务器来进行投票，每次投票会包含所推举的服务器的 sid 和 zxid，使用myid, zxid来表示，此时 Server1的投票为(1, 0)，Server2 的投票为(2, 0)，然后各自将这个投票发给集群中其他机器。

　　(2) 接受来自各个服务器的投票。集群的每个服务器收到投票后，首先判断该投票的有效性，如检查是否是本轮投票（epoch）、是否来自LOOKING状态的服务器。

　　(3) 处理投票。针对每一个投票，服务器都需要将别人的投票和自己的投票进行 PK，PK 规则如下

　　　　i. 优先检查 zxid。zxid比较大的服务器优先作为Leader

　　　　ii. 如果 zxid相同，那么就比较 myid。myid较大的服务器作为 Leader 服务器。

　　对于 Server1 而言，它的投票是(1, 0)，接收 Server2的投票为(2, 0)，首先会比较两者的 zxid，均为 0，再比较 myid，此时 Server2 的 myid最大，于是更新自己的投票为(2, 0)，然后重新投票，对于 Server2 而言，它不需要更新自己的投票，只是再次向集群中所有机器发出上一次投票信息即可。

　　(4) 统计投票。每次投票后，服务器都会统计投票信息，判断是否已经有过半机器接受到相同的投票信息，对于 Server1、Server2 而言，都统计出集群中已经有两台机器接受了(2, 0)的投票信息，此时便认为已经选出了 Leader。

　　(5) 改变服务器状态。一旦确定了 Leader，每个服务器就会更新自己的状态，如果是 Follower，那么就变更为FOLLOWING，如果是 Leader，就变更为 LEADING。

4.2 集群运行时选举

当集群中的 leader 服务器出现宕机或者不可用的情况时，那么整个集群将无法对外提供服务，而是进入新一轮的Leader 选举，服务器运行期间的 Leader 选举和启动时期的 Leader 选举基本过程是一致的。

　　(1) 变更状态。Leader 挂后，余下的非 Observer 服务器都会将自己的服务器状态变更为 LOOKING，然后开始进入 Leader 选举过程。

　　(2) 每个 Server 会发出一个投票。在运行期间，每个服务器上的 zxid可能不同，此时假定 Server1 的 zxid为123，Server3的zxid为122；在第一轮投票中，Server1和 Server3 都会投自己，产生投票(1, 123)，(3, 122)，然后各自将投票发送给集群中所有机器。接收来自各个服务器的投票。与启动时过程相同。

(3) 处理投票。与启动时过程相同

　　(4) 统计投票。与启动时过程相同，此时，Server1 将会成为 Leader。　　(5) 改变服务器的状态。与启动时过程相同

补充概念：

acceptedEpoch：Follower 已经接受 Leader 更改 epoch 的 newEpoch 提议；
currentEpoch：当前所处的 Leader 年代；

5、Leader 选举源码分析

5.1 源码编译部署

需要先安装ant编译工具，先下载ant：http://ant.apache.org/bindownload.cgi

配置环境变量，跟JDK没区别：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

使用ant 编译进入解压后的Zookeeper工程 build.xml 所在那层目录，输入 ant eclipse ，转成eclipse项目（有点慢）

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

编译完成后，导入到IDEA，需要以eclipse项目的形式导入：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

接下来一路向北Next：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

5.2 源码形式启动zookeeper

我们这里因为是源码部署集群，所以需要使用不同的端口：2181,2182,2183

将zoo_sample.cfg改为zoo1.cfg,并在运行配置中设置路径参数为全路径,设置下dataDir路径：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

然后启动QuorumPeerMain.main方法

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

其他两台服务器除了端口不同，其他没有区别

我们之前测试例子中zookeeper地址改为你本地的，能正常运行说明没问题！

5.3 源码解读

5.3.1 启动类main方法

有了理论基础以后，我们读一下源码（zookeeper-3.4.14），看看他的实现逻辑。首先我们需要知道源码入口，也就是Zookeeper启动的主类： QuorumPeerMain 类的 main 方法开始：

/**
     * 服务入口
     * To start the replicated server specify the configuration file name on
     * the command line.
     * @param args path to the configfile
     */
    public static void main(String[] args) {
        QuorumPeerMain main = new QuorumPeerMain();
        try {
            main.initializeAndRun(args);//兄die，看这里就行
        } catch (IllegalArgumentException e) {
            LOG.error("Invalid arguments, exiting abnormally", e);
            LOG.info(USAGE);
            System.err.println(USAGE);
            System.exit(2);
        } catch (ConfigException e) {
            LOG.error("Invalid config, exiting abnormally", e);
            System.err.println("Invalid config, exiting abnormally");
            System.exit(2);
        } catch (Exception e) {
            LOG.error("Unexpected exception, exiting abnormally", e);
            System.exit(1);
        }
        LOG.info("Exiting normally");
        System.exit(0);
    }

5.3.2 判断解析配置文件

进入 main.initializeAndRun(args) 可以看到：

protected void initializeAndRun(String[] args)
        throws ConfigException, IOException
    {
        QuorumPeerConfig config = new QuorumPeerConfig();
        if (args.length == 1) {
            //解析配置文件
            config.parse(args[0]);
        }
        // 启动后台定时任务异步执行清除任务，删除垃圾数据
        // Start and schedule the the purge task
        DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
                .getDataDir(), config.getDataLogDir(), config
                .getSnapRetainCount(), config.getPurgeInterval());
        purgeMgr.start();
        //判断是集群还是单机
        if (args.length == 1 && config.servers.size() > 0) {
            // 集群入口
            runFromConfig(config);
        } else {
            LOG.warn("Either no config or no quorum defined in config, running "
                    + " in standalone mode");
            // there is only server in the quorum -- run as standalone
            //单机入口
            ZooKeeperServerMain.main(args);
        }
    }

5.3.3 初始化选举仲裁类

进入 runFromConfig（）：

public void runFromConfig(QuorumPeerConfig config) throws IOException {
      try {
          //注册日志对象，忽略
          ManagedUtil.registerLog4jMBeans();
      } catch (JMException e) {
          LOG.warn("Unable to register log4j JMX control", e);
      }
  
      LOG.info("Starting quorum peer");
      try {
          // 初始化NIOServerCnxnFactory
          ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
          cnxnFactory.configure(config.getClientPortAddress(),
                                config.getMaxClientCnxns());
         // 逻辑主线程 进行投票，选举（实例化选举仲裁类）
          quorumPeer = getQuorumPeer();
          // 进入一系列的配置
          quorumPeer.setQuorumPeers(config.getServers());
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                  new File(config.getDataLogDir()),
                  new File(config.getDataDir())));
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId()); //配置 myid
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier());
          // 为客户端提供写的server 即2181访问端口的访问功能
          quorumPeer.setClientPortAddress(config.getClientPortAddress());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }

          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
         // 初始化的工作
          quorumPeer.initialize();
          // 启动主线程，QuorumPeer 重写了 Thread.start 方法
          quorumPeer.start();
          quorumPeer.join();//调用Thread join方法，使得线程之间的并行执行变为串行执行
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }

5.3.4 启动仲裁类线程

调用 QuorumPeer 的 start 方法：

@Override
public synchronized void start() {
        //载入本地DB数据 主要还是epoch
        loadDataBase();
　　　　　//启动ZooKeeperThread线程
        cnxnFactory.start();    
        //启动leader选举线程    
        startLeaderElection();
        super.start();
}

5.3.5 加载本地事务日志和快照日志

loadDataBase() 主要是从本地文件中恢复数据，以及获取最新的 zxid。

private void loadDataBase() {
        File updating = new File(getTxnFactory().getSnapDir(),
                                 UPDATING_EPOCH_FILENAME);
        try {//载入本地数据
            zkDb.loadDataBase();
            // load the epochs 加载ZXID，这个数字是计算机语言创始的时间戳1970-01-01
            long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
            // 根据zxid的高32位是epoch号，低32位是事务id进行抽离epoch号
            long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
            try {//从${data}/version-2/currentEpochs文件中加载当前的epoch号
                currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
                //从 zxid中提取的epoch比文件里的epoch要大的话，并且没有正在修改epoch
                if (epochOfZxid > currentEpoch && updating.exists()) {
                    setCurrentEpoch(epochOfZxid);//设置位大的epoch
                    if (!updating.delete()) {
                        throw new IOException("Failed to delete " +
                                              updating.toString());
                    }
                }
            } 
　　　　　　　 // ........
            //如果如果还比他大 抛出异常
            if (epochOfZxid > currentEpoch) {
                throw new IOException("The current epoch, " + ZxidUtils.zxidToString(currentEpoch) + ", is older than the last zxid, " + lastProcessedZxid);
            }
            try {//再比较 acceptedEpoch
                acceptedEpoch = readLongFromFile(ACCEPTED_EPOCH_FILENAME);
            }
            // ........
            if (acceptedEpoch < currentEpoch) {
                throw new IOException("The accepted epoch, " + ZxidUtils.zxidToString(acceptedEpoch) + " is less than the current epoch, " + ZxidUtils.zxidToString(currentEpoch));
            }
　　　　　　　// .......
}

5.3.6 zxid低32位生成

zxid低32位的由来，根据一系列的算法，然后参照snapshot.*快照文件来生成

     public long loadDataBase() throws IOException {
        long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener);
        initialized = true;
        return zxid;
    }

    public long restore(DataTree dt, Map<Long, Integer> sessions, 
            PlayBackListener listener) throws IOException {
        snapLog.deserialize(dt, sessions);
        return fastForwardFromEdits(dt, sessions, listener);
    }

    /**
        从最近的快照文件反序列化生成zxid
     * deserialize a data tree from the most recent snapshot
     * @return the zxid of the snapshot
     */ 
    public long deserialize(DataTree dt, Map<Long, Integer> sessions)
            throws IOException {
        // we run through 100 snapshots (not all of them)
        // if we cannot get it running within 100 snapshots
        // we should  give up
        List<File> snapList = findNValidSnapshots(100);
        if (snapList.size() == 0) {
            return -1L;
        }
        File snap = null;
        boolean foundValid = false;
        for (int i = 0; i < snapList.size(); i++) {
            snap = snapList.get(i);
            InputStream snapIS = null;
            CheckedInputStream crcIn = null;
            try {
                LOG.info("Reading snapshot " + snap);
                snapIS = new BufferedInputStream(new FileInputStream(snap));
                crcIn = new CheckedInputStream(snapIS, new Adler32());
                InputArchive ia = BinaryInputArchive.getArchive(crcIn);
                deserialize(dt,sessions, ia);
                long checkSum = crcIn.getChecksum().getValue();
                long val = ia.readLong("val");
                if (val != checkSum) {
                    throw new IOException("CRC corruption in snapshot :  " + snap);
                }
                foundValid = true;
                break;
            } catch(IOException e) {
                LOG.warn("problem reading snap file " + snap, e);
            } finally {
                if (snapIS != null) 
                    snapIS.close();
                if (crcIn != null) 
                    crcIn.close();
            } 
        }
        if (!foundValid) {
            throw new IOException("Not able to find valid snapshots in " + snapDir);
        }
        dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX);//zxid算法
        return dt.lastProcessedZxid;
    }

5.3.7 开始初始化选举算法

synchronized public void startLeaderElection() {
        try { // 根据myid zxid epoch 3个选举参数创建Voto 对象，准备选举
            currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
        } catch(IOException e) {
            RuntimeException re = new RuntimeException(e.getMessage());
            re.setStackTrace(e.getStackTrace());
            throw re;
        }
        //进行节点遍历，得到本机地址与leader交换信息的端口
        for (QuorumServer p : getView().values()) {
            if (p.id == myid) {
                myQuorumAddr = p.addr;
                break;
            }
        }
        if (myQuorumAddr == null) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
        }
        //根据类型创建选举算法
        if (electionType == 0) {
            try {//创建 UDP Socket
                udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        //如果是这个选举策略，代表 LeaderElection选举策略
        this.electionAlg = createElectionAlgorithm(electionType);
    }

5.3.8 配置选举算法

进入选举算法的初始化 createElectionAlgorithm()：配置选举算法，选举算法有 4 种，可以通过在 zoo.cfg 里面进行配置，默认是 FastLeaderElection 选举

protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;
        // 选择选举策略
        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3://Leader选举IO负责类
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                // 启动已绑定端口的选举线程，等待其他服务器连接
                listener.start();
                //基于 TCP的选举算法
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

5.3.9 初始化集群消息发送接收队列

继续看 FastLeaderElection 的初始化动作，主要初始化了业务层的发送队列和接收队列：

public FastLeaderElection(QuorumPeer self, QuorumCnxManager manager){
        this.stop = false;
        this.manager = manager;
        starter(self, manager);
}

// ***********************************************

private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;
       // 投票 发送队列 阻塞
        sendqueue = new LinkedBlockingQueue<ToSend>();
        // 投票 接受队列 阻塞
        recvqueue = new LinkedBlockingQueue<Notification>();
        this.messenger = new Messenger(manager);
}

然后在 Messager 的构造函数里初始化发送和接受两个线程并且启动线程。

Messenger(QuorumCnxManager manager) {//异步决策
　　　　　　　// 创建 vote 发送线程
            this.ws = new WorkerSender(manager);
            Thread t = new Thread(this.ws,"WorkerSender[myid=" + self.getId() + "]");
            t.setDaemon(true);
            t.start();//启动
            // 创建 vote 接受线程
            this.wr = new WorkerReceiver(manager);
            t = new Thread(this.wr,"WorkerReceiver[myid=" + self.getId() + "]");
            t.setDaemon(true);
            t.start();//启动
        }

5.3.10 开启选举阻塞线程

然后再回到 QuorumPeer.java。 FastLeaderElection 初始化完成以后，调用 super.start()，最终运行 QuorumPeer 的run 方法

public void run() {
        setName("QuorumPeer" + "[myid=" + getId() + "]" +
                cnxnFactory.getLocalAddress());
        //.........................
        //通过JMX初始化。来监控一些属性的代码
        try {
            // Main loop  主循环
            while (running) {
                switch (getPeerState()) {
                case LOOKING: //LOOKING 状态，则进入选举
                    if (Boolean.getBoolean("readonlymode.enabled")) {
　　　　　　　　　　　　　　 // 创建 ReadOnlyZooKeeperServer，但是不立即启动
                        final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                                logFactory, this,
                                new ZooKeeperServer.BasicDataTreeBuilder(),
                                this.zkDb);
    　　　　　　　　　　　　//通过 Thread 异步解耦
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
　　　　　　　　　　　　　　　　　　　// .......　　
                            }
                        };
                        try {//启动
                            roZkMgr.start();
                            setBCVote(null);
                           // 通过策略模式来决定当前用那个算法选举
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        // .........
                    } else {
                        try {
                            setBCVote(null);
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        //........
                    }
                    break;
//****************************************************************************
                case OBSERVING: // Observing 针对 Observer角色的节点
                    try {
                        LOG.info("OBSERVING");
                        setObserver(makeObserver(logFactory));
                        observer.observeLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e );                        
                    } finally {
                        observer.shutdown();
                        setObserver(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
//*****************************************************************************
                case FOLLOWING:// 从节点状态
                    try {
                        LOG.info("FOLLOWING");
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        follower.shutdown();
                        setFollower(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
//**********************************************************************
                case LEADING: // leader 节点
                    LOG.info("LEADING");
                    try {
                        setLeader(makeLeader(logFactory));
                        leader.lead();
                        setLeader(null);
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        if (leader != null) {
                            leader.shutdown("Forcing shutdown");
                            setLeader(null);
                        }
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                }
            }
        }
　　 // ..........
}

5.3.11 阻塞选举过程

由于是刚刚启动，是 LOOKING 状态。所以走第一条分支。调用 setCurrentVote(makeLEStrategy().lookForLeader());，最终根据上一步选择的策略应该运行 FastLeaderElection 中的选举算法，看一下 lookForLeader（）；

//开始选举 Leader
public Vote lookForLeader() throws InterruptedException {
　　　　 // ...省略一些代码
        try {
            // 收到的投票
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
            // 存储选举结果 
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet(); // 增加逻辑时钟
                // 修改自己的zxid epoch 
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }
            sendNotifications(); // 发送投票

            // Loop in which we exchange notifications until we find a leader
            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){ // 主循环  直到选举出leader
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                //从IO进程里面 获取投票结果，自己的投票也在里面
                Notification n = recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);

                // 如果没有获取到足够的通知久一直发送自己的选票，也就是持续进行选举
                if(n == null){
                    // 如果空了 就继续发送  直到选举出leader
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                    // 消息没发出去，可能其他集群没启动 继续尝试连接
                        manager.connectAll();
                    }
                    /// 延长超时时间 
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                // 收到投票消息 查看是否属于本集群内的消息
                else if(self.getVotingView().containsKey(n.sid)) {
                    switch (n.state) {// 判断收到消息的节点状态
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        // 判断epoch 是否大于 logicalclock ，如是，则是新一轮选举
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch); // 更新本地logicalclock
                            recvset.clear(); // 清空接受队列
                            // 一次性比较 myid epoch zxid 看此消息是否胜出
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, //此方法看下面代码
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                //投票结束修改票据为 leader票据
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {//否则票据不变
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications(); // 继续广播票据，让其他节点知道我现在的投票
                         //如果是epoch小于当前  忽略
                        } else if (n.electionEpoch < logicalclock.get()) {
                            break;
                        //如果 epoch 相同 跟上面一样的比较 更新票据 广播票据
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                       // 把最终票据放进接受队列 用来做最后判断
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                       // 判断选举是否结束 默认算法是否超过半数同意 见下面代码
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // 一直等待 notification 到达 直到超时
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }
                            // 确定 leader 
                            if (n == null) {
                                // 修改状态
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                //返回最终投票结果
                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    // 如果收到的选票状态 不是LOOKING 比如己气刚刚加入已经选举好的集群 
                    // Observer 不参与选举
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                   
                    case FOLLOWING:
                    case LEADING:
                        // 判断 epoch 是否相同
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                            // 投票是否结束 结束的话确认leader 是否有效
                            // 如果结束 修改自己的投票并且返回
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        //在加入一个已建立的集群之前，确认大多数人都在跟随同一个Leader。
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    LOG.warn("Ignoring notification from non-cluster member " + n.sid);
                }
            }
            return null;
        } 
　　　// .......
}

5.3.12 选举规则解读

上述选举中是通过获取到选票，然后根据选票中的3大元素跟本地进行比对。进入 totalOrderPredicate ：

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }
　　　　/*如果以下三种情况之一成立，则返回true:
　　　　* 1-选票中epoch更高
　　　　* 2-选票中epoch与当前epoch相同，但新zxid更高
　　　　* 3-选票中epoch与当前epoch相同，新zxid与当前zxid相同服务器id更高。
　　　　*/
        //这里判断规则很明显，先比较epoch 如果相等比较 zxid 继续想等继续比较 myid 大的为leader
        return ((newEpoch > curEpoch) || 
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

选票方法，遍历选票集合，查看是否有人选票超过一半，其实就是判断是否选出了leader：

protected boolean termPredicate(HashMap<Long, Vote> votes,Vote vote) {
        HashSet<Long> set = new HashSet<Long>();
        /*
         * First make the views consistent. Sometimes peers will have
         * different zxids for a server depending on timing.
         */
        // 遍历接收到的集合  把符合当前投票的 放入 Set
        for (Map.Entry<Long,Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())){
                set.add(entry.getKey());
            }
        }
        // 统计票据，看是否过半
        return self.getQuorumVerifier().containsQuorum(set);
    }

到这里为止，Leader选举就结束了。

5.3.13 广播消息提交事务

我们再来看看消息如何广播，看 sendNotifications：

private void sendNotifications() {
        for (QuorumServer server : self.getVotingView().values()) {// 循环发送
            long sid = server.id;
            // 准备发送的消息实体
            ToSend notmsg = new ToSend(ToSend.mType.notification,
                    proposedLeader,
                    proposedZxid,
                    logicalclock.get(),
                    QuorumPeer.ServerState.LOOKING,
                    sid,
                    proposedEpoch);
            if(LOG.isDebugEnabled()){
                LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                      Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                      " (n.round), " + sid + " (recipient), " + self.getId() +
                      " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
            }
            sendqueue.offer(notmsg); // 使用offer 添加到队列 会被sendWorker线程消费
        }
    }

5.3.14 FastLeaderElection 选举过程总结

Zookeeper是怎么么接受请求的，主要逻辑就在这个地方：

@Override
public synchronized void start() {
        //载入本地DB数据 主要还是epoch
        loadDataBase();
　　　　　//启动ZooKeeperThread线程
        cnxnFactory.start();    
        //启动leader选举线程    
        startLeaderElection();
        super.start();
}

其中 cnxnFactory.start(); 就是启动了服务端的接受请求的线程，默认实现有两个： NIO 及 Netty:

　至于怎么设置请看如下代码ServerCnxnFactory：

//这个是我们需要配置的属性key
public static final String ZOOKEEPER_SERVER_CNXN_FACTORY = "zookeeper.serverCnxnFactory";
public static ServerCnxnFactory createFactory() throws IOException {
        String serverCnxnFactoryName =
            System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
        if (serverCnxnFactoryName == null) {//默认是NIO
            serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
        }
        try {//这里配置的类即Netty
            ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName)
                    .getDeclaredConstructor().newInstance();
            LOG.info("Using {} as server connection factory", serverCnxnFactoryName);
            return serverCnxnFactory;
        } catch (Exception e) {
            IOException ioe = new IOException("Couldn't instantiate "
                    + serverCnxnFactoryName);
            ioe.initCause(e);
            throw ioe;
        }
    }

你可以通过环境变量指定使用Netty服务端运行方式：

-Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2h1eGlhbmcxOTg1MTExNA==,size_16,color_FFFFFF,t_70

5.4 注意

idea模拟集群选举过程测试，请每一个服务节点都开一个idea窗口，否则所有服务节点的代码断点在一起，不方便测试

八五年的湘哥

关注

3
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
深入分析Zookeeper的Leader选举原理

目录1、Zookeeper 集群角色2、ZAB协议3、什么是ZXID4、Leader选举4.1 集群初始化选举4.2 集群运行时选举5、Leader 选举源码分析5.1 源码编译部署5.2 源码形式启动zookeeper5.3 源码解读5.4 注意1、Zookeeper 集群角色Leader 角色：Leader 服务器是整个 zookeeper 集群的核心，主要的工作任务有两项:1. 事务请求的唯一调度和事务处理者，保证集群事务处理的顺序性2.
复制链接

扫一扫

专栏目录