zookeeper(五）ZooKeeper源码分析

最新推荐文章于 2023-06-26 14:06:02 发布

stay hungry,stay you

最新推荐文章于 2023-06-26 14:06:02 发布

阅读量190

点赞数 1

分类专栏： zookeeper

本文链接：https://blog.csdn.net/weixin_41987908/article/details/105008813

版权

zookeeper 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

启动源码分析
Leader选举源码分析

（一）启动源码分析

Leader选举是保证分布式数据一致性的关键所在。ZooKeeper在初始化启动时和运行期间都可能进行Leader选举。Leader选举主要会在以下情况发生：

服务器初始化启动时
服务器运行期间Leader崩溃
服务器运行期间事务Zxid过大触发Leader选举

（1）启动时选举

若进行Leader选举，则至少需要两台机器，这里选取3台机器组成的服务器集群为例。在集群初始化阶段，当有一台服务器Server1启动时，其单独无法进行和完成Leader选举，当第二台服务器Server2启动时，此时两台机器可以相互通信，每台机器都试图找到Leader，于是进入Leader选举过程。

选举过程如下：

每个Server发出一个投票

由于是初始情况，Server1和Server2都会将自己作为Leader服务器来进行投票，每次投票会包含所推举的服务器的myid和ZXID，使用(myid, ZXID)来表示，此时Server1的投票为(1, 0)，Server2的投票为(2, 0)，然后各自将这个投票发给集群中其他机器。

接受来自各个服务器的投票

集群的每个服务器收到投票后，首先判断该投票的有效性，如检查是否是本轮投票、是否来自LOOKING状态的服务器。

处理投票

针对每一个投票，服务器都需要将别人的投票和自己的投票进行PK，PK规则如下
● 先检查Epoch。Epoch大的服务会作为Leader。
● 如果Epoch相同再检查ZXID。ZXID比较大的服务器优先作为Leader。
● 如果ZXID相同，那么就比较myid。myid较大的服务器作为Leader服务器。
对于Server1而言，它的投票是(1, 0)，接收Server2的投票为(2, 0)，首先会比较两者的ZXID，均为0，再比较myid，此时Server2的myid最大，于是更新自己的投票为(2, 0)，然后重新投票，对于Server2而言，其无须更新自己的投票，只是再次向集群中所有机器发出上一次投票信息即可。

统计投票

每次投票后，服务器都会统计投票信息，判断是否已经有过半机器接受到相同的投票信息，对于Server1、Server2而言，都统计出集群中已经有两台机器接受了(2, 0)的投票信息，此时便认为已经选出了Leader。

改变服务器状态

一旦确定了Leader，每个服务器就会更新自己的状态，如果是Follower，那么就变更为FOLLOWING，如果是Leader，就变更为LEADING。

（2）运行期选举

在Zookeeper运行期间，Leader与非Leader服务器各司其职，即便当有非Leader服务器宕机或新加入，此时也不会影响Leader，但是一旦Leader服务器挂了，那么整个集群将暂停对外服务，进入新一轮Leader选举，其过程和启动时期的Leader选举过程基本一致。假设正在运行的有Server1、Server2、Server3三台服务器，当前Leader是Server2，若某一时刻Leader挂了，此时便开始Leader选举。

选举过程如下：

变更状态

Leader挂后，余下的非Observer服务器都会讲自己的服务器状态变更为LOOKING，然后开始进入Leader选举过程。

每个Server会发出一个投票

运行期间，每个服务器上的ZXID可能不同，此时假定Server1的ZXID为8，Server3的ZXID为9；在第一轮投票中，Server1和Server3都会投自己，产生投票(1, 8)，(3, 9)，然后各自将投票发送给集群中所有机器。

接收来自各个服务器的投票（与启动时过程相同）
处理投票（与启动时过程相同）
统计投票（与启动时过程相同）
改变服务器的状态（与启动时过程相同）

（3）Leader选举源码分析

ZooKeeper的选举策略接口为：

public interface Election {
    public Vote lookForLeader() throws InterruptedException;
    public void shutdown();
}

ZooKeeper提供了三种Leader算法实现：

LeaderElection：Fast Paxos最简单的一种实现（每个Server启动以后都询问其它的Server它要投票给谁）
AuthLeaderElection：同FastLeaderElection算法基本一致，只是在消息中加入了认证信息
FastLeaderEelection：ZooKeeper默认的选举策略（所有Server提议自己要成为Leader）

// org.apache.zookeeper.server.quorum.QuorumPeer
protected Election createElectionAlgorithm(int electionAlgorithm){
    Election le=null;

    //TODO: use a factory rather than a switch
    switch (electionAlgorithm) {
        case 0:
            // 为0时选择LeaderElection
            le = new LeaderElection(this);
            break;
        case 1:
            // 为1时选择AuthFastLeaderElection，是否验证为false
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            // 为2时选择AuthFastLeaderElection，是否验证为true
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
            // 为3时选择FastLeaderElection
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                listener.start();
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
    }
    return le;
}

FastLeaderElection实现了Election接口，其需要实现接口中定义的lookForLeader方法和shutdown方法，其是标准的Fast Paxos算法的实现，各服务器之间基于TCP协议进行选举。

（4）主要类介绍

Notification：该类表示当前节点收到的投票信息（其他节点发来的选举投票信息）

// 两种情况下需要发送投票信息：
// 1. 发送节点已加入了Leader选举
// 2. 发送节点从其他节点获得了一个较大的zxid或相同zxid而serverid较大
static public class Notification {
    // 被推荐Leader
    long leader;

    // 所推荐Leader Server的zxid
    long zxid;

    // 选举周期。用来判断多个投票是否在同一轮选举周期中
    long electionEpoch;

    // 发送者当前的状态
    QuorumPeer.ServerState state;

    // 发送这条消息的服务器sid
    long sid;

    // 被推荐Leader的epoch
    long peerEpoch;
}

ToSend：当前节点想要发送给其他节点的投票信息

static public class ToSend {
    // 被推荐的Leader
    long leader;

    // 被推荐Leader的zxid
    long zxid;

    // 选举周期。用来判断多个投票是否在同一轮选举周期中
    long electionEpoch;

    // 节点当前状态
    QuorumPeer.ServerState state;

    // 接收节点的sid
    long sid;

    // 被推荐Leader的epoch
    long peerEpoch;
}

FastLeaderElection

// org.apache.zookeeper.server.quorum.FastLeaderElection
public class FastLeaderElection implements Election {

    // 负责服务器之间Leader选举过程的网络通信
    QuorumCnxManager manager;
    
    // 选票发送队列
    LinkedBlockingQueue<ToSend> sendqueue;
    // 选票接收队列
    LinkedBlockingQueue<Notification> recvqueue;
    
    // 消息处理类
    protected class Messenger {
        // 从QuorumCnxManager接收消息并处理该消息
        class WorkerReceiver extends ZooKeeperThread { ...... }

        // 该类使要发送的消息出队并将其放入QuorumCnxManager的队列中
        class WorkerSender extends ZooKeeperThread { ...... }
        
        // 选票发送线程。作用是将FastLeaderElection的ToSend转化为QuorumCnxManager的Message
        // 不断地从sendqueue中获取待发送的选票，并将其传递给QuorumCnxManager
        WorkerSender ws;

        // 选票接收线程。将QuorumCnxManager的Message转化为FastLeaderElection的Notification
        // 不断地从QuorumCnxManager中获取其他服务器发来的选举消息，并将其转换成一个选票并保存到recvqueue中
        // 在选票接收过程中，如果发现该外部选票的选举轮次小于当前服务器的，那么忽略该外部投票
        WorkerReceiver wr;
    }
    
    // 负责管理Quorum协议，并根据表决结果进行角色转换
    QuorumPeer self;
    Messenger messenger;
    // 逻辑时钟。用来表示ZooKeeper服务器Leader选举的轮次
    AtomicLong logicalclock = new AtomicLong();
    // 推荐的Leader
    long proposedLeader;
    // 推荐Leader的Zxid
    long proposedZxid;
    // 推荐Leader的选举epoch
    long proposedEpoch;
}

QuorumCnxManager

每个服务器启动时，都会启动一个QuorumCnxManager，它主要负责各个服务器之间Leader选举过程中的网络通信。
QuorumCnxManager内部维护了一系列的队列，用于保存接收的、待发送的消息及消息发送器。所有队列都会按照SID分组构成队列集合。假如，集群中除自身外还有2台机器，那么当前服务器就会为另外2台各创建一个发送队列。

// org.apache.zookeeper.server.quorum.QuorumCnxManager
public class QuorumCnxManager {
    
    // 消息发送器
    // 按照sid进行分组，每个SenderWorker都单独对应一台服务器
    final ConcurrentHashMap<Long, SendWorker> senderWorkerMap;

    // 消息发送队列。用于保存待发送消息
    // 按照sid进行分组，为每台服务器分配一个单独队列
    final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap;

    // 为每个sid保留最近发送的一条消息
    final ConcurrentHashMap<Long, ByteBuffer> lastMessageSent;
    
    // 消息接受队列
    public final ArrayBlockingQueue<Message> recvQueue;
}

算法详解：

public Vote lookForLeader() throws InterruptedException {
    if (self.start_fle == 0) {
        self.start_fle = Time.currentElapsedTime();
    }
    try {
        // 存放接受来自它节点的选票
        HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

        // 将该选举结果加入到集合中，再根据集合来判断是否可以结束选举
        HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

        // Leader选举一旦结束，需要等待的时间
        int notTimeout = finalizeWait;

        synchronized(this){
            // 逻辑时钟+1
            logicalclock.incrementAndGet();
            // 使用本地的sid、zxid和epoch更新投票信息
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }

        // 广播新的选票信息。往发送队列（sendqueue）里插入一条投票信息
        sendNotifications();

        // 循环交换通知直到找到Leader
        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){
            // 从recvqueue取出一条收到的消息
            Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

            // 如果未收到消息，则发送更多通知。否则处理新的消息。
            if(n == null){
                if(manager.haveDelivered()){
                    // 如果消息都投递，发送更多的消息
                    sendNotifications();
                } else {
                    // 否则表示网络可能出现问题，重新连接其他节点
                    manager.connectAll();
                }

                // 退避重试
                int tmpTimeOut = notTimeout*2;
                notTimeout = (tmpTimeOut < maxNotificationInterval?
                              tmpTimeOut : maxNotificationInterval);
                LOG.info("Notification time out: " + notTimeout);
            }
            // 有收到消息
            else if(validVoter(n.sid) && validVoter(n.leader)) {
                // 仅当投票来自投票视图时才继续进行（投票视图即：是本集群的服务器节点）
                switch (n.state) {
                    case LOOKING:
                        // 对选举周期进行判断，如果选票中的周期大于本地周期
                        if (n.electionEpoch > logicalclock.get()) {
                            // 设置本地的逻辑时钟为选票的逻辑时钟
                            logicalclock.set(n.electionEpoch);
                            // 清空接收选票的缓存
                            recvset.clear();
                            // 对选票进行PK
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                                   getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                // 消息的选票信息PK获胜，则更新本地选票信息
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                // 否则，仍然使用本地选票信息
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                            }
                            // 广播新的选票信息
                            sendNotifications();
                        // 如果选票中的逻辑时钟小于本地逻辑时间，则什么记录日志
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                            }
                            break;
                        // 如果选票的逻辑时钟和本地的逻辑时钟相等，说明是一个选举周期，选票进行PK
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                                       proposedLeader, proposedZxid, proposedEpoch)) {
                            // 消息的选票信息PK获胜，则更新本地选票信息
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            // 广播新的选票信息
                            sendNotifications();
                        }

                        // 添加到本机投票集合，用来做选举终结判断
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        // 验证投票是否胜出，默认Quorum算法（过半胜出）
                        if (termPredicate(recvset,
                                          new Vote(proposedLeader, proposedZxid,
                                                   logicalclock.get(), proposedEpoch))) {

                            // 校验是否有新的选票接受（可能导致Leader发生变化）
                            // 这个时候并不会立即更新服务器状态，而是等待一段时间（默认200ms）来确认是否有新的选票
                            while((n = recvqueue.poll(finalizeWait,
                                                      TimeUnit.MILLISECONDS)) != null){
                                // 如果新来的选票胜出
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                                       proposedLeader, proposedZxid, proposedEpoch)){
                                    // 选票接收队列
                                    recvqueue.put(n);
                                    // 跳出循环
                                    break;
                                }
                            }

                            // 一旦没有从接收队列中读取到任何新的消息，则结束本轮选举
                            if (n == null) {
                                // 根据sid判断自己是否为Leader
                                self.setPeerState((proposedLeader == self.getId()) ?
                                                  ServerState.LEADING: learningState());
                                // 构造选票类
                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                // 返回最终结果选票
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        // 外部投票推荐的服务器为Leader
                        if(n.electionEpoch == logicalclock.get()){
                            // 添加到本机投票集合
                            recvset.put(n.sid, new Vote(n.leader,
                                                        n.zxid,
                                                        n.electionEpoch,
                                                        n.peerEpoch));

                            // 判断是否已选出Leader
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                                  ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                                        n.zxid, 
                                                        n.electionEpoch, 
                                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        // 已经完成的选票放入outofelection
                        outofelection.put(n.sid, new Vote(n.version,
                                                          n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch,
                                                          n.state));

                        // 在变更节点状态前，验证多数节点跟随的是同一个Leader
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                // 设置节点状态
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                                  ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                 n.state, n.sid);
                        break;
                }
            } else {
                if (!validVoter(n.leader)) {
                    .......
                }
                if (!validVoter(n.sid)) {
                    .......
                }
            }
        }
        return null;
    } finally {

    }
}

// 判断一个被选出的节点是否真的可以做Leader
// 如果其他所有人都认为我是Leader，那么我必须是Leader
protected boolean checkLeader(
    HashMap<Long, Vote> votes,
    long leader,
    long electionEpoch){

    boolean predicate = true;
    
    // 另外两项检查仅用于我不是Leader的情况。如果我不是领导者，并且没有收到来自Leader的消息则返回false。
    if(leader != self.getId()){
        if(votes.get(leader) == null) predicate = false;
        else if(votes.get(leader).getState() != ServerState.LEADING) predicate = false;
    } else if(logicalclock.get() != electionEpoch) {
        predicate = false;
    } 

    return predicate;
}

源码流程：
在这里插入图片描述

（二） Leader选举源码分析

（1）ZooKeeper启动

org.apache.zookeeper.server.quorum.QuorumPeerMain

public void runFromConfig(QuorumPeerConfig config) throws IOException {
    try {
        ManagedUtil.registerLog4jMBeans();
    } catch (JMException e) {
        LOG.warn("Unable to register log4j JMX control", e);
    }

    LOG.info("Starting quorum peer");
    try {
        ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
        cnxnFactory.configure(config.getClientPortAddress(),
                              config.getMaxClientCnxns());

        quorumPeer = getQuorumPeer();

        quorumPeer.setQuorumPeers(config.getServers());
        quorumPeer.setTxnFactory(new FileTxnSnapLog(
            new File(config.getDataLogDir()),
            new File(config.getDataDir())));
        quorumPeer.setElectionType(config.getElectionAlg());
        quorumPeer.setMyid(config.getServerId());
        quorumPeer.setTickTime(config.getTickTime());
        quorumPeer.setInitLimit(config.getInitLimit());
        quorumPeer.setSyncLimit(config.getSyncLimit());
        quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
        quorumPeer.setCnxnFactory(cnxnFactory);
        quorumPeer.setQuorumVerifier(config.getQuorumVerifier());
        quorumPeer.setClientPortAddress(config.getClientPortAddress());
        quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
        quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
        quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
        quorumPeer.setLearnerType(config.getPeerType());
        quorumPeer.setSyncEnabled(config.getSyncEnabled());

        // sets quorum sasl authentication configurations
        quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
        if(quorumPeer.isQuorumSaslAuthEnabled()){
            quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
            quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
            quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
            quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
            quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
        }

        quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
        quorumPeer.initialize();

        quorumPeer.start();
        quorumPeer.join();
    } catch (InterruptedException e) {
        // warn, but generally this is ok
        LOG.warn("Quorum Peer interrupted", e);
    }
}

org.apache.zookeeper.server.quorum.QuorumPeer

public QuorumPeer(Map<Long, QuorumServer> quorumPeers, File dataDir,
                  File dataLogDir, int electionType,
                  long myid, int tickTime, int initLimit, int syncLimit,
                  boolean quorumListenOnAllIPs,
                  ServerCnxnFactory cnxnFactory, 
                  QuorumVerifier quorumConfig) throws IOException {
    this();
    this.cnxnFactory = cnxnFactory;
    this.quorumPeers = quorumPeers;
    this.electionType = electionType;
    this.myid = myid;
    this.tickTime = tickTime;
    this.initLimit = initLimit;
    this.syncLimit = syncLimit;        
    this.quorumListenOnAllIPs = quorumListenOnAllIPs;
    // 快照和日志数据
    this.logFactory = new FileTxnSnapLog(dataLogDir, dataDir);
    // 快照和日志数据构成内存数据
    this.zkDb = new ZKDatabase(this.logFactory);
    if(quorumConfig == null)
        this.quorumConfig = new QuorumMaj(countParticipants(quorumPeers));
    else this.quorumConfig = quorumConfig;
}

（2）加载数据

org.apache.zookeeper.server.ZKDatabase

public ZKDatabase(FileTxnSnapLog snapLog) {
    dataTree = new DataTree();
    sessionsWithTimeouts = new ConcurrentHashMap<Long, Integer>();
    this.snapLog = snapLog;
}

public long loadDataBase() throws IOException {
    long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener);
    initialized = true;
    return zxid;
}

（3）初始化内存数据

org.apache.zookeeper.server.quorum.QuorumPeer

private void loadDataBase() {
    File updating = new File(getTxnFactory().getSnapDir(),
                             UPDATING_EPOCH_FILENAME);
    try {
        zkDb.loadDataBase();

        // 从恢复到内存中的数据加载最后一个处理的Zxid
        long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
        // 从Zxid中解析出epoch
        long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
        try {
            // 从currentEpoch文件中加载currentEpoch
            currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
            // 如果zxid中的epoch大于currentEpoch并且存在更新，则说明update文件较旧
            if (epochOfZxid > currentEpoch && updating.exists()) {
                LOG.info("{} found. The server was terminated after " +
                         "taking a snapshot but before updating current " +
                         "epoch. Setting current epoch to {}.",
                         UPDATING_EPOCH_FILENAME, epochOfZxid);
                setCurrentEpoch(epochOfZxid);
                if (!updating.delete()) {
                    throw new IOException("Failed to delete " +
                                          updating.toString());
                }
            }
        } catch(FileNotFoundException e) {
            // pick a reasonable epoch number
            // this should only happen once when moving to a
            // new code version
            currentEpoch = epochOfZxid;
            // 记录currentEpoch
            writeLongToFile(CURRENT_EPOCH_FILENAME, currentEpoch);
        }
        if (epochOfZxid > currentEpoch) {
            // 抛出异常
        }
        try {
            acceptedEpoch = readLongFromFile(ACCEPTED_EPOCH_FILENAME);
        } catch(FileNotFoundException e) {
            // pick a reasonable epoch number
            // this should only happen once when moving to a
            // new code version
            acceptedEpoch = epochOfZxid;
            // 记录currentEpoch
            writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch);
        }
        if (acceptedEpoch < currentEpoch) {
            // 抛出异常
        }
    } catch(IOException ie) {
        LOG.error("Unable to load database on disk", ie);
        throw new RuntimeException("Unable to run quorum server ", ie);
    }
}

stay hungry,stay you

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
zookeeper(五）ZooKeeper源码分析

启动源码分析Leader选举源码分析（一）启动源码分析   Leader选举是保证分布式数据一致性的关键所在。ZooKeeper在初始化启动时和运行期间都可能进行Leader选举。Leader选举主要会在以下情况发生：服务器初始化启动时服务器运行期间Leader崩溃服务器运行期间事务Zxid过大触发Leader选举（1）启动时选举 ...
复制链接

扫一扫