zookeeper源码解析(二 选举算法源码思路与源码解析)

选举算法源码思路

        Zookeeper的Leader选举类是FastLeaderElection,该类是ZAB协议在Leader选举中的工程应用,所以直接找到该类对其进行分析。该类中最为重要的方法为lookForLeader(),是选举Leader的核心方法。该方法大体思路可以划分为以下几块:

1 选举前的准备工作

        选举前需要做一些准备工作,例如,创建选举对象、创建选举过程中需要用到的集合、初始化选举时限等。

2 将自己作为初始化Leader投出去

        在当前Server第一次投票时会先将自己作为Leader,然后将自己的选票广播给其它所有Server。

3 验证自己的投票与大家的投票谁更适合做Leader

        在“我选我”后,当前Server同样会接收到其它Server发送来的选票通知(Notification)。通过while循环,遍历所有接收到的选票通知,比较谁更适合做Leader。若找到一个比自己更适合的Leader,则修改自己选票,重新将新的选票广播出去。当然,每验证一个选票,则会将其记录到一个集合中,将来用于进行票数统计。

4 判断本轮选举是否应结束

        其实在每次验证过谁更适合做Leader后,就会马上判断当前的选举是否可以结束了,即当前主机所推荐的这个选票是否过半了。若过半了,则直接完成后续的一些收尾工作,例如清空选举过程中所使用的集合,以备下次使用;再例如,生成最终的选票,以备其它Server来同步数据。若没有过半,则继续从队列中读取出下一个来自于其它主机的选票,然后进行验证。

选举算法源码解析

        对源码的阅读主要包含两方面。一个是对重要类、重要成员变量、重要方法的注释的阅读;一个是对重要方法的业务逻辑的分析。

1 源码结构

2.leader选举源码

FastLeaderElection部分代码

/**
 * Implementation of leader election using TCP. It uses an object of the class
 * QuorumCnxManager to manage connections. Otherwise, the algorithm is push-based
 * as with the other UDP implementations.
 *
 * There are a few parameters that can be tuned to change its behavior. First,
 * finalizeWait determines the amount of time to wait until deciding upon a leader.
 * This is part of the leader election algorithm.
 *
 * 使用TCP实现leader领导人选举,它使用QuorumCnxManager类的一个对象管理连接(与其他server间的连接管理)。
 * 否则(如果不使用QuorumCnxManager对象的话),将使用UDP基于推送的算法实现。
 *
 * 有几个参数可以用来改变它(选举)的行为。
 * 首先,finalizeWait(一个代码中的常量)决定选举一个leader的时间,
 * 这是leader选举算法的一部分。
 */


public class FastLeaderElection implements Election {
    private static final Logger LOG = LoggerFactory.getLogger(FastLeaderElection.class);

    /**
     * Determine how much time a process has to wait
     * once it believes that it has reached the end of
     * leader election.
     * (此常量)决定一个选举过程需要等待的选举时间;
     * 一经到达,将结束leader选举。
     * 默认200毫秒
     * 实际此时间是节点等待收取其他节点选举消息(Notification)的时间;
     */
    final static int finalizeWait = 200;


    /**
     * Upper bound on the amount of time between two consecutive
     * notification checks. This impacts the amount of time to get
     * the system up again after long partitions. Currently 60 seconds.
     * (此常量)指定两个连续的notification检查的时间间隔上线;
     * 其影响了系统在经历了长时间分割后再次重启的时间,默认60秒。
     * 此常量其实就是finalizeWait的最大值,代表如果在60秒内还没有选举出leader,
     * 那么重新发起一轮选举;
     */

    final static int maxNotificationInterval = 60000;

    /**
     * Connection manager. Fast leader election uses TCP for
     * communication between peers, and QuorumCnxManager manages
     * such connections.
     * 连接管理者(类)。FastLeaderElection选举算法使用TCP(管理)
     * 两个同辈server的通信,并且QuorumCnxManager还管理着这些连接。
     */

    QuorumCnxManager manager;


    /**
     * Notifications are messages that let other peers know that
     * a given peer has changed its vote, either because it has
     * joined leader election or because it learned of another
     * peer with higher zxid or same zxid and higher server id
     * Notifications是一个让其它server知道当前server已经改变了
     * 投票的通知消息(为什么要改变投票?),要么是因为它参与了leader
     * 选举(新一轮投票,首先投给自己),要么是它具有更大的zxid,或者
     * zxid相同但是ServerID(myid)更大。
     */

    static public class Notification {
        /*
         * Format version, introduced in 3.4.6
         */
        
        public final static int CURRENTVERSION = 0x1; 
        int version;
                
        /*
         * Proposed leader
         * 当前选票所推荐做leader的ServerID
         */
        long leader;

        /*
         * zxid of the proposed leader
         * 当前选票所推荐做leader的最大zxid
         */
        long zxid;

        /*
         * Epoch
         * 当前本轮选举的epoch,即逻辑时钟
         */
        long electionEpoch;

        /*
         * current state of sender
         * 当前通知的发送者的状态(四种状态)
         */
        QuorumPeer.ServerState state;

        /*
         * Address of sender
         * 当前通知发送者的serverID
         */
        long sid;

        /*
         * epoch of the proposed leader
         * 当前选票所推荐做leader的epoch
         */
        long peerEpoch;
            
            .
            .
            .
            .
            .
            .
    QuorumPeer self;  //当前参与选举的server(当前主机)
    Messenger messenger;
    //logicalclock逻辑时钟,原子整型
    AtomicLong logicalclock = new AtomicLong(); /* Election instance */
    //记录当前server的推荐情况
    long proposedLeader;
    long proposedZxid;
    long proposedEpoch;

            .
            .
            .
            .
            .
            .

QuorumPeer

/**
 * This class manages the quorum protocol. There are three states this server
 * can be in:
 * <ol>
 * <li>Leader election - each server will elect a leader (proposing itself as a
 * leader initially).</li>
 * <li>Follower - the server will synchronize with the leader and replicate any
 * transactions.</li>
 * <li>Leader - the server will process requests and forward them to followers.
 * A majority of followers must log the request before it can be accepted.
 * </ol>
 *
 * This class will setup a datagram socket that will always respond with its
 * view of the current leader. The response will take the form of:
 * 这个类管理着“法定人数投票”协议;这个服务器有三种状态:
 * (1) Leader election:(处于该状态)每一个服务器将选举一个Leader(最初推荐自己
 * 作为Leader)。(此状态即为LOOKING状态)
 * (2)Folloer:(处于该状态的)服务器将与Leader做同步,并复制所有的事务(注意这里
 * 的事务指的是最终的提议Proposal;txid中的tx即为事务)。
 * (3)Leader:(处于该状态的)服务器将处理请求,并将这些请求转发给其他Follower,大
 * 多数Follower在该写请求被批准之前(before it can be accepted)都必须要记录下该
 * 请求(注意,这里的请求指的是写请求,Leader在接收到写请求后会向所有Follower发出
 * 提议,在大多数Follower同意后该写请求才会被批准执行)。
 * <pre>
 * int xid;
 *
 * long myid;
 *
 * long leader_id;
 *
 * long leader_zxid;
 * </pre>
 *
 * The request for the current leader will consist solely of an xid: int xid;
 */
public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumPeer.class);

    QuorumBean jmxQuorumBean;
    LocalPeerBean jmxLocalPeerBean;
    LeaderElectionBean jmxLeaderElectionBean;
    QuorumCnxManager qcm;
    QuorumAuthServer authServer;
    QuorumAuthLearner authLearner;
    // VisibleForTesting. This flag is used to know whether qLearner's and
    // qServer's login context has been initialized as ApacheDS has concurrency
    // issues. Refer https://issues.apache.org/jira/browse/ZOOKEEPER-2712
    private boolean authInitialized = false;

            .
            .
            .
            .
            .
            .

QuorumCnxManager

/**
 * This class implements a connection manager for leader election using TCP. It
 * maintains one connection for every pair of servers. The tricky part is to
 * guarantee that there is exactly one connection for every pair of servers that
 * are operating correctly and that can communicate over the network.
 * 这个类使用TCP实现了一个用于Leader选举的连接管理器。
 * 它为每一对服务器维护着一个连接。棘手的部分在于确保[为每对服务器正确地操作
 * 并且可以与整个网络进行通信的]连接恰有一个。
 *
 * If two servers try to start a connection concurrently, then the connection
 * manager uses a very simple tie-breaking mechanism to decide which connection
 * to drop based on the IP addressed of the two parties. 
 * 如果两个服务器试图同时启动一个连接,则连接管理器使用非常简单的中断连接
 * 机制来决定哪个中断,基于双方的IP地址。
 *
 *
 * For every peer, the manager maintains a queue of messages to send. If the
 * connection to any particular peer drops, then the sender thread puts the
 * message back on the list. As this implementation currently uses a queue
 * implementation to maintain messages to send to another peer, we add the
 * message to the tail of the queue, thus changing the order of messages.
 * Although this is not a problem for the leader election, it could be a problem
 * when consolidating peer communication. This is to be verified, though.
 * 对于每个对等体(sever),管理器维护着一个消息发送队列。如果连接到任何
 * 特定的Server中断,那么发送者线程将消息放回到这个队列中。
 * 作为这个实现,当前使用一个队列来实现维护发送给另一方的消息,因此我们将消息
 * 添加到队列的尾部,从而更改了消息的顺序。虽然对于Leader选举来说这不是一个问题,
 * 但对于加强对等通信可能就是个问题。不过,这一点有待验证。
 */

public class QuorumCnxManager {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumCnxManager.class);

    /*
     * Maximum capacity of thread queues
     */
    static final int RECV_CAPACITY = 100;
    // Initialized to 1 to prevent sending
    // stale notifications to peers
    static final int SEND_CAPACITY = 1;

    static final int PACKETMAXSIZE = 1024 * 512;

    /*
     * Max buffer size to be read from the network.
     */
    static public final int maxBuffer = 2048;
    
    /*
     * Negative counter for observer server ids.
     */
    
    private AtomicLong observerCounter = new AtomicLong(-1);
    
    /*
     * Connection time out value in milliseconds 
     */
    
    private int cnxTO = 5000;

消息发送队列:QuorumCnxManager

        上图标示myid为1的server中的QuorumCnxManager对象维护着一个消息队列,队列中的Key都是由1号主机将消息发往的主机的myid,所有不存在自己的myid(其他主机也同样);value存储的是1号主机发送失败的消息副本,发送成功不放入;

根据当前Map中的各个Value(队列)是否为空,可以判断当前Server与整个集群的连接状态:

  • 若所有队列均为空:说明当前Server与集群的连接没有任务问题;
  • 若所有队列均为不空:说明当前Server与集群已经失去连接;
  • 若某一个队列不为空:说明当前Server与该队列对应的Server的连接出现问题;
  • 若某一个队列为空:说明当前Server与集群的连接正常;(代码实现中应用此条件判断)

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值