e盾服务端源码_Zookeeper源码(启动+选举)

简介

  关于Zookeeper,目前普遍的应用场景基本作为服务注册中心,用于服务发现。但这只是Zookeeper的一个的功能,根据Apache的官方概述:“The Apache ZooKeeper system for distributed coordination is a high-performance service for building distributed applications.” Zookeeper是一个用于构建分布式应用的coordination, 并且为高性能的。Zookeeper借助于它内部的节点结构和监听机制,能用于很大部分的分布式协调场景。配置管理、命名服务、分布式锁、服务发现和发布订阅等等,这些场景在Zookeeper中基本使用其节点的“变更+通知”来实现。因为分布式的重点在于通信,通信的作用也就是协调。

  Zookeeper由Java语言编写(也有C语言的Api实现),对于其原理,算是Paxos算法的实现,包含了Leader、Follower、Proposal等角色和选举之类的一些概念,但于Paxos还有一些不同(ZAB协议)。对于Paxos算法,个人认为,它是一套解决方案的理论,想要理解也有点的复杂。这里对于算法不太深入概述,仅对于Zookeeper服务端进行部分源码解析,包含应用的启动和选举方面,不包含Client。

源码获取

  Zookeeper源码可以从Github(https://github.com/apache/zookeeper)上clone下来;

  也可从Zookeeper官网(Apache)https://zookeeper.apache.org/releases.html上获取。

  Zookeeper在3.5.5之前使用的是Ant构建,在3.5.5开始使用的是Maven构建。

       7d65b6914c14c90647d8dfdb9ee0b536.png

       本次采用的3.5.4版本进行解析  

工程结构

  目录结构:

       b33436a56fc3ee650378ae518303f647.png

    其中src中包含了C和Java源码,本次忽略C的Api。conf下为配置文件,也就是Zookeeper启动的配置文件。bin为Zookeeper启动脚本(server/client)。

  org.apache.jute为Zookeeper的通信协议和序列化相关的组件,其通信协议基于TCP协议,它提供了Record接口用于序列化和反序列化,OutputArchive/InputArchive接口.

  org.apache.zookeeper下为Zookeeper核心代码。包含了核心的业务实现。

启动流程

  在我们使用Zookeeper的应用时,通过“./zkServer.sh start”命令来启动程序。通过查看zkServer.sh脚本,可以追踪到Zookeeper程序启动入口为“org.apache.zookeeper.server.quorum.QuorumPeerMain”,同时为程序指定了日志相关的配置。

ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"#.......nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" \    "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \    -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \    -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &    if [ $? -eq 0 ]  #.......

  Zookeeper启动流程:

  4c0b18c126fda4a8ab3376dd3091fd08.png

  QuorumPeerMain.main()接受至少一个参数,一般就一个参数,参数为zoo.cfg文件路径。main方法中没有很多的业务代码,实例化了一个QuorumPeerMain 对象,然后main.initializeAndRun(args);进行了实例化

public static void main(String[] args) {        QuorumPeerMain main = new QuorumPeerMain();        try {            main.initializeAndRun(args);        } catch (IllegalArgumentException e) {            LOG.error("Invalid arguments, exiting abnormally", e);            LOG.info(USAGE);            System.err.println(USAGE);            System.exit(2);        } catch (ConfigException e) {            LOG.error("Invalid config, exiting abnormally", e);            System.err.println("Invalid config, exiting abnormally");            System.exit(2);        } catch (DatadirException e) {            LOG.error("Unable to access datadir, exiting abnormally", e);            System.err.println("Unable to access datadir, exiting abnormally");            System.exit(3);        } catch (AdminServerException e) {            LOG.error("Unable to start AdminServer, exiting abnormally", e);            System.err.println("Unable to start AdminServer, exiting abnormally");            System.exit(4);        } catch (Exception e) {            LOG.error("Unexpected exception, exiting abnormally", e);            System.exit(1);        }        LOG.info("Exiting normally");        System.exit(0);    }

  initializeAndRun方法则通过实例化QuorumPeerConfig对象,通过parseProperties()来解析zoo.cfg文件中的配置,QuorumPeerConfig包含了Zookeeper整个应用的配置属性。接着开启一个DatadirCleanupManager对象来开启一个Timer用于清除并创建管理新的DataDir相关的数据。

  最后进行程序的启动,因为Zookeeper分为单机和集群模式,所以分为两种不同的启动方式,当zoo.cfg文件中配置了standaloneEnabled=true为单机模式,如果配置server.0,server.1......集群节点,则为集群模式.

protected void initializeAndRun(String[] args)        throws ConfigException, IOException, AdminServerException{        QuorumPeerConfig config = new QuorumPeerConfig();        if (args.length == 1) {            config.parse(args[0]);        }        // Start and schedule the the purge task        DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config                .getDataDir(), config.getDataLogDir(), config                .getSnapRetainCount(), config.getPurgeInterval());        purgeMgr.start();        // 当配置了多节点信息,return quorumVerifier!=null && (!standaloneEnabled || quorumVerifier.getVotingMembers().size() > 1);        if (args.length == 1 && config.isDistributed()) {            // 集群模式            runFromConfig(config);        } else {            LOG.warn("Either no config or no quorum defined in config, running "                    + " in standalone mode");            // there is only server in the quorum -- run as standalone            // 单机模式            ZooKeeperServerMain.main(args);        }    }

 单机模式启动

  当配置了standaloneEnabled=true 或者没有配置集群节点(sever.*)时,Zookeeper使用单机环境启动。单机环境启动入口为ZooKeeperServerMain类,ZooKeeperServerMain类中持有ServerCnxnFactory、ContainerManager和AdminServer对象;

public class ZooKeeperServerMain {    /*.............*/    // ZooKeeper server supports two kinds of connection: unencrypted and encrypted.    private ServerCnxnFactory cnxnFactory;    private ServerCnxnFactory secureCnxnFactory;    private ContainerManager containerManager;    private AdminServer adminServer;    /*.............*/}

  ServerCnxnFactory为Zookeeper中的核心组件,用于网络通信IO的实现和管理客户端连接,Zookeeper内部提供了两种实现,一种是基于JDK的NIO实现,一种是基于netty的实现。

  86ac1f118d365e0578cadf908a595111.png

   ContainerManager类,用于管理维护Zookeeper中节点Znode的信息,管理zkDatabase;

   AdminServer是一个Jetty服务,默认开启8080端口,用于提供Zookeeper的信息的查询接口。该功能从3.5的版本开始。

   ZooKeeperServerMain的main方法中同QuorumPeerMain中一致,先实例化本身的对象,再进行init,加载配置文件,然后启动。

  加载配置信息:

// 解析单机模式的配置对象,并启动单机模式    protected void initializeAndRun(String[] args)        throws ConfigException, IOException, AdminServerException{        try {            //注册jmx           // JMX的全称为Java Management Extensions.是管理Java的一种扩展。           // 这种机制可以方便的管理、监控正在运行中的Java程序。常用于管理线程,内存,日志Level,服务重启,系统环境等            ManagedUtil.registerLog4jMBeans();        } catch (JMException e) {            LOG.warn("Unable to register log4j JMX control", e);        }        // 创建服务配置对象        ServerConfig config = new ServerConfig();        //如果入参只有一个,则认为是配置文件的路径        if (args.length == 1) {            // 解析配置文件            config.parse(args[0]);        } else {            // 参数有多个,解析参数            config.parse(args);        }        // 根据配置运行服务        runFromConfig(config);    }

  服务启动: runFromConfig()为应用启动之前初始化一些对象,

  1.  初始化FileTxnSnapLog对象,用于管理dataDir和datalogDir数据。

  2.  初始化ZooKeeperServer 对象;

  3.  实例化CountDownLatch线程计数器对象,在程序启动后,执行shutdownLatch.await();用于挂起主程序,并监听Zookeeper运行状态。

  4.  创建adminServer(Jetty)服务并开启。

  5.   创建ServerCnxnFactory对象,cnxnFactory = ServerCnxnFactory.createFactory(); Zookeeper默认使用NIOServerCnxnFactory来实现网络通信IO。

  6.  启动ServerCnxnFactory服务

  7.  创建ContainerManager对象,并启动;

  8.  Zookeeper应用启动。

public void runFromConfig(ServerConfig config)            throws IOException, AdminServerException {        LOG.info("Starting server");        FileTxnSnapLog txnLog = null;        try {            // Note that this thread isn't going to be doing anything else,            // so rather than spawning another thread, we will just call            // run() in this thread.            // create a file logger url from the command line args            //初始化日志文件            txnLog = new FileTxnSnapLog(config.dataLogDir, config.dataDir);           // 初始化zkServer对象            final ZooKeeperServer zkServer = new ZooKeeperServer(txnLog,                    config.tickTime, config.minSessionTimeout, config.maxSessionTimeout, null);            // 服务结束钩子,用于知道服务器错误或关闭状态更改。            final CountDownLatch shutdownLatch = new CountDownLatch(1);            zkServer.registerServerShutdownHandler(                    new ZooKeeperServerShutdownHandler(shutdownLatch));            // Start Admin server            // 创建admin服务,用于接收请求(创建jetty服务)            adminServer = AdminServerFactory.createAdminServer();            // 设置zookeeper服务            adminServer.setZooKeeperServer(zkServer);            // AdminServer是3.5.0之后支持的特性,启动了一个jettyserver,默认端口是8080,访问此端口可以获取Zookeeper运行时的相关信息            adminServer.start();            boolean needStartZKServer = true;            //---启动ZooKeeperServer            //判断配置文件中 clientportAddress是否为null            if (config.getClientPortAddress() != null) {                //ServerCnxnFactory是Zookeeper中的重要组件,负责处理客户端与服务器的连接                //初始化server端IO对象,默认是NIOServerCnxnFactory:Java原生NIO处理网络IO事件                cnxnFactory = ServerCnxnFactory.createFactory();                //初始化配置信息                cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), false);                //启动服务:此方法除了启动ServerCnxnFactory,还会启动ZooKeeper                cnxnFactory.startup(zkServer);                // zkServer has been started. So we don't need to start it again in secureCnxnFactory.                needStartZKServer = false;            }            if (config.getSecureClientPortAddress() != null) {                secureCnxnFactory = ServerCnxnFactory.createFactory();                secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), true);                secureCnxnFactory.startup(zkServer, needStartZKServer);            }            // 定时清除容器节点            //container ZNodes是3.6版本之后新增的节点类型,Container类型的节点会在它没有子节点时            // 被删除(新创建的Container节点除外),该类就是用来周期性的进行检查清理工作            containerManager = new ContainerManager(zkServer.getZKDatabase(), zkServer.firstProcessor,                    Integer.getInteger("znode.container.checkIntervalMs", (int) TimeUnit.MINUTES.toMillis(1)),                    Integer.getInteger("znode.container.maxPerMinute", 10000)            );            containerManager.start();            // Watch status of ZooKeeper server. It will do a graceful shutdown            // if the server is not running or hits an internal error.            // ZooKeeperServerShutdownHandler处理逻辑,只有在服务运行不正常的情况下,才会往下执行            shutdownLatch.await();            // 关闭服务            shutdown();            if (cnxnFactory != null) {                cnxnFactory.join();            }            if (secureCnxnFactory != null) {                secureCnxnFactory.join();            }            if (zkServer.canShutdown()) {                zkServer.shutdown(true);            }        } catch (InterruptedException e) {            // warn, but generally this is ok            LOG.warn("Server interrupted", e);        } finally {            if (txnLog != null) {                txnLog.close();            }        }    }

  Zookeeper中 ServerCnxnFactory默认采用了NIOServerCnxnFactory来实现,也可以通过配置系统属性zookeeper.serverCnxnFactory 来设置使用Netty实现;

static public ServerCnxnFactory createFactory() throws IOException {        String serverCnxnFactoryName =            System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);        if (serverCnxnFactoryName == null) {            //如果未指定实现类,默认使用NIOServerCnxnFactory            serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();        }        try {            ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName)                    .getDeclaredConstructor().newInstance();            LOG.info("Using {} as server connection factory", serverCnxnFactoryName);            return serverCnxnFactory;        } catch (Exception e) {            IOException ioe = new IOException("Couldn't instantiate "                    + serverCnxnFactoryName);            ioe.initCause(e);            throw ioe;        }

    }

  cnxnFactory.startup(zkServer);方法启动了ServerCnxnFactory ,同时启动ZooKeeper服务

public void startup(ZooKeeperServer zks, boolean startServer)            throws IOException, InterruptedException {        // 启动相关线程        //开启NIOWorker线程池,        //启动NIO Selector线程        //启动客户端连接处理acceptThread线程        start();        setZooKeeperServer(zks);        //启动服务        if (startServer) {            // 加载数据到zkDataBase            zks.startdata();            // 启动定时清除session的管理器,注册jmx,添加请求处理器            zks.startup();        }    }

  zks.startdata();

public void startdata() throws IOException, InterruptedException {        //初始化ZKDatabase,该数据结构用来保存ZK上面存储的所有数据        //check to see if zkDb is not null        if (zkDb == null) {            //初始化数据数据,这里会加入一些原始节点,例如/zookeeper            zkDb = new ZKDatabase(this.txnLogFactory);        }        //加载磁盘上已经存储的数据,如果有的话        if (!zkDb.isInitialized()) {            loadData();        }    }

  zks.startup();

public synchronized void startup() {        //初始化session追踪器        if (sessionTracker == null) {            createSessionTracker();        }        //启动session追踪器        startSessionTracker();        //建立请求处理链路        setupRequestProcessors();        //注册jmx        registerJMX();        setState(State.RUNNING);        notifyAll();    }

  最终Zookeeper应用服务启动,并处于监听状态。

 集群模式启动

  Zookeeper主程序QuorumPeerMain加载配置文件后,配置容器对象QuorumPeerConfig中持有一个QuorumVerifier对象,该对象会存储其他Zookeeper server节点信息,如果zoo.cfg中配置了server.*节点信息,会实例化一个QuorumVeriferi对象。其中AllMembers = VotingMembers + ObservingMembers

public interface QuorumVerifier {    long getWeight(long id);    boolean containsQuorum(Set set);    long getVersion();    void setVersion(long ver);    MapgetAllMembers();    MapgetVotingMembers();    MapgetObservingMembers();    boolean equals(Object o);    String toString();}

  如果quorumVerifier.getVotingMembers().size() > 1 则使用集群模式启动。调用runFromConfig(QuorumPeerConfig config),同时会实例化ServerCnxnFactory 对象,初始化一个QuorumPeer对象。

  QuorumPeer为一个Zookeeper节点, QuorumPeer 为一个线程类,代表一个Zookeeper服务线程,最终会启动该线程。

  runFromConfig方法中设置了一些列属性。包括选举类型、server Id、节点数据库等信息。最后通过quorumPeer.start();启动Zookeeper节点。

public void runFromConfig(QuorumPeerConfig config)            throws IOException, AdminServerException    {      try {          // 注册jmx          ManagedUtil.registerLog4jMBeans();      } catch (JMException e) {          LOG.warn("Unable to register log4j JMX control", e);      }      LOG.info("Starting quorum peer");      try {          ServerCnxnFactory cnxnFactory = null;          ServerCnxnFactory secureCnxnFactory = null;          if (config.getClientPortAddress() != null) {              cnxnFactory = ServerCnxnFactory.createFactory();              // 配置客户端连接端口              cnxnFactory.configure(config.getClientPortAddress(),                      config.getMaxClientCnxns(),                      false);          }          if (config.getSecureClientPortAddress() != null) {              secureCnxnFactory = ServerCnxnFactory.createFactory();              // 配置安全连接端口              secureCnxnFactory.configure(config.getSecureClientPortAddress(),                      config.getMaxClientCnxns(),                      true);          }          // ------------初始化当前zk服务节点的配置----------------          // 设置数据和快照操作          quorumPeer = getQuorumPeer();          quorumPeer.setTxnFactory(new FileTxnSnapLog(                      config.getDataLogDir(),                      config.getDataDir()));          quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());          quorumPeer.enableLocalSessionsUpgrading(              config.isLocalSessionsUpgradingEnabled());          //quorumPeer.setQuorumPeers(config.getAllMembers());          // 选举类型          quorumPeer.setElectionType(config.getElectionAlg());          // server Id          quorumPeer.setMyid(config.getServerId());          quorumPeer.setTickTime(config.getTickTime());          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());          quorumPeer.setInitLimit(config.getInitLimit());          quorumPeer.setSyncLimit(config.getSyncLimit());          quorumPeer.setConfigFileName(config.getConfigFilename());          // 设置zk的节点数据库          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));          quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);          if (config.getLastSeenQuorumVerifier()!=null) {              quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);          }          // 初始化zk数据库          quorumPeer.initConfigInZKDatabase();          quorumPeer.setCnxnFactory(cnxnFactory);          quorumPeer.setSecureCnxnFactory(secureCnxnFactory);          quorumPeer.setLearnerType(config.getPeerType());          quorumPeer.setSyncEnabled(config.getSyncEnabled());          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());          // sets quorum sasl authentication configurations          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);          if(quorumPeer.isQuorumSaslAuthEnabled()){              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);          }          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);          // -------------初始化当前zk服务节点的配置---------------          quorumPeer.initialize();          //启动          quorumPeer.start();          quorumPeer.join();      } catch (InterruptedException e) {          // warn, but generally this is ok          LOG.warn("Quorum Peer interrupted", e);      }    }

  quorumPeer.start(); Zookeeper会首先加载本地磁盘数据,如果之前存在一些Zookeeper信息,则会加载到Zookeeper内存数据库中。通过FileTxnSnapLog中的loadDatabse();

public synchronized void start() {        // 校验serverid如果不在peer列表中,抛异常        if (!getView().containsKey(myid)) {            throw new RuntimeException("My id " + myid + " not in the peer list");         }        // 加载zk数据库:载入之前持久化的一些信息        loadDataBase();        // 启动连接服务端        startServerCnxnFactory();        try {            adminServer.start();        } catch (AdminServerException e) {            LOG.warn("Problem starting AdminServer", e);            System.out.println(e);        }        // 启动之后马上进行选举,主要是创建选举必须的环境,比如:启动相关线程        startLeaderElection();        // 执行选举逻辑        super.start();    }

  加载数据完之后同单机模式启动一样,会调用ServerCnxnFactory.start(),启动NIOServerCnxnFactory服务和Zookeeper服务,最后启动AdminServer服务。

  与单机模式启动不同的是,集群会在启动之后马上进行选举操作,会在配置的所有Zookeeper server节点中选举出一个leader角色。startLeaderElection(); 

选举

  Zookeeper中分为Leader、Follower和Observer三个角色,各个角色扮演不同的业务功能。在Leader故障之后,Follower也会选举一个新的Leader。

  Leader为集群中的主节点,一个集群只有一个Leader,Leader负责处理Zookeeper的事物操作,也就是更改Zookeeper数据和状态的操作。

  Follower负责处理客户端的读请求和参与选举。同时负责处理Leader发出的事物提交请求,也就是提议(proposal)。

  Observer用于提高Zookeeper集群的读取的吞吐量,响应读请求,和Follower不同的是,Observser不参与Leader的选举,也不响应Leader发出的proposal。

  有角色就有选举。有选举就有策略,Zookeeper中的选举策略有三种实现:包括了LeaderElection、AuthFastLeaderElection和FastLeaderElection,目前Zookeeper默认采用FastLeaderElection,前两个选举算法已经设置为@Deprecated;

  Zookeeper节点信息

  serverId:服务节点Id,也就是Zookeeper dataDir中配置的myid ,server.*上指定的id。0,1,2,3,4..... ,该Id启动后不变

  zxid:数据状态Id,zookeeper每次更新状态之后增加,可理解为全局有序id ,zxid越大,表示数据越新。Zxid是一个64位的数字,高32位为epoch,低32位为递增计数。

  epoch:选举时钟,也可以理解为选举轮次,没进行一次选举,该值会+1;

  ServerState:服务状态,Zookeeper节点角色状态,分为LOOKING、FOLLOWING、LEADING和OBSERVING,分别对应于不同的角色,当处于选举时,节点处于Looking状态。

  每次投票,一个Vote会包含Zookeeper节点信息。

  Zookeeper在启动之后会马上进行选举操作,不断的向其他Follower节点发送选票信息,同时也接收别的Follower发送过来的选票信息。最终每个Follower都持有共同的一个选票池,通过同样的算法选出Leader,如果当前节点选为Leader,则向其他每个Follower发送信息,如果没有则向Leader发送信息。

  Zookeeper定义了Election接口;其中lookForLeader()就是选举操作。

public interface Election {2     public Vote lookForLeader() throws InterruptedException;3     public void shutdown();4 }

  在上面的集群模式启动流程中,最后会调用startLeaderElection()来下进行选举操作。startLeaderElection()中指定了选举算法。同时定义了为自己投一票(坚持你自己,年轻人!),一个Vote包含了投票节点、当前节点的zxid和当前的epoch。Zookeeper默认采取了FastLeaderElection选举算法。最后启动QuorumPeer线程,开始投票。

synchronized public void startLeaderElection() {       try {           // 所有节点启动的初始状态都是LOOKING,因此这里都会是创建一张投自己为Leader的票           if (getPeerState() == ServerState.LOOKING) {               currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());           }       } catch(IOException e) {           RuntimeException re = new RuntimeException(e.getMessage());           re.setStackTrace(e.getStackTrace());           throw re;       }       // if (!getView().containsKey(myid)) {      //      throw new RuntimeException("My id " + myid + " not in the peer list");        //}        if (electionType == 0) {            try {                udpSocket = new DatagramSocket(myQuorumAddr.getPort());                responder = new ResponderThread();                responder.start();            } catch (SocketException e) {                throw new RuntimeException(e);            }        }        //初始化选举算法,electionType默认为3        this.electionAlg = createElectionAlgorithm(electionType);    }

  FastLeaderElection类中定义三个内部类Notification、 ToSend 和 Messenger ,Messenger 中又定义了WorkerReceiver 和 WorkerSender 

  65d180dcd296555f54c35bbe322668b4.png

  Notification类表示收到的选举投票信息(其他服务器发来的选举投票信息),其包含了被选举者的id、zxid、选举周期等信息。

  ToSend类表示发送给其他服务器的选举投票信息,也包含了被选举者的id、zxid、选举周期等信息。

  Message类为消息处理的类,用于发送和接收投票信息,包含了WorkerReceiver和WorkerSender两个线程类。

  FastLeaderElection类:

public class FastLeaderElection implements Election {    //..........    /**     * Connection manager. Fast leader election uses TCP for     * communication between peers, and QuorumCnxManager manages     * such connections.     */    QuorumCnxManager manager;    /*        Notification表示收到的选举投票信息(其他服务器发来的选举投票信息),        其包含了被选举者的id、zxid、选举周期等信息,        其buildMsg方法将选举信息封装至ByteBuffer中再进行发送     */    static public class Notification {       //..........    }    /**     * Messages that a peer wants to send to other peers.     * These messages can be both Notifications and Acks     * of reception of notification.     */    /*     ToSend表示发送给其他服务器的选举投票信息,也包含了被选举者的id、zxid、选举周期等信息     */    static public class ToSend {        //..........    }    LinkedBlockingQueue sendqueue;    LinkedBlockingQueue recvqueue;    /**     * Multi-threaded implementation of message handler. Messenger     * implements two sub-classes: WorkReceiver and  WorkSender. The     * functionality of each is obvious from the name. Each of these     * spawns a new thread.     */    protected class Messenger {        /**         * Receives messages from instance of QuorumCnxManager on         * method run(), and processes such messages.         */        class WorkerReceiver extends ZooKeeperThread  {             //..........        }        /**         * This worker simply dequeues a message to send and         * and queues it on the manager's queue.         */        class WorkerSender extends ZooKeeperThread {            //..........        }        WorkerSender ws;        WorkerReceiver wr;        Thread wsThread = null;        Thread wrThread = null;    }    //..........    QuorumPeer self;    Messenger messenger;    AtomicLong logicalclock = new AtomicLong(); /* Election instance */    long proposedLeader;    long proposedZxid;    long proposedEpoch;    //..........}

  QuorumPeer线程启动后会开启对ServerState的监听,如果当前服务节点属于Looking状态,则会执行选举操作。Zookeeper服务器启动后是Looking状态,所以服务启动后会马上进行选举操作。通过调用makeLEStrategy().lookForLeader()进行投票操作,也就是FastLeaderElection.lookForLeader();

  QuorumPeer.run():

public void run() {        updateThreadName();        //..........        try {            /*             * Main loop             */            while (running) {                switch (getPeerState()) {                case LOOKING:                    LOG.info("LOOKING");                    if (Boolean.getBoolean("readonlymode.enabled")) {                        final ReadOnlyZooKeeperServer roZk =                            new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);                        Thread roZkMgr = new Thread() {                            public void run() {                                try {                                    // lower-bound grace period to 2 secs                                    sleep(Math.max(2000, tickTime));                                    if (ServerState.LOOKING.equals(getPeerState())) {                                        roZk.startup();                                    }                                } catch (InterruptedException e) {                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");                                } catch (Exception e) {                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);                                }                            }                        };                        try {                            roZkMgr.start();                            reconfigFlagClear();                            if (shuttingDownLE) {                                shuttingDownLE = false;                                startLeaderElection();                            }                            setCurrentVote(makeLEStrategy().lookForLeader());                        } catch (Exception e) {                            LOG.warn("Unexpected exception", e);                            setPeerState(ServerState.LOOKING);                        } finally {                            roZkMgr.interrupt();                            roZk.shutdown();                        }                    } else {                        try {                           reconfigFlagClear();                            if (shuttingDownLE) {                               shuttingDownLE = false;                               startLeaderElection();                               }                            setCurrentVote(makeLEStrategy().lookForLeader());                        } catch (Exception e) {                            LOG.warn("Unexpected exception", e);                            setPeerState(ServerState.LOOKING);                        }                    }                    break;                case OBSERVING:                    try {                        LOG.info("OBSERVING");                        setObserver(makeObserver(logFactory));                        observer.observeLeader();                    } catch (Exception e) {                        LOG.warn("Unexpected exception",e );                    } finally {                        observer.shutdown();                        setObserver(null);                       updateServerState();                    }                    break;                case FOLLOWING:                    try {                       LOG.info("FOLLOWING");                        setFollower(makeFollower(logFactory));                        follower.followLeader();                    } catch (Exception e) {                       LOG.warn("Unexpected exception",e);                    } finally {                       follower.shutdown();                       setFollower(null);                       updateServerState();                    }                    break;                case LEADING:                    LOG.info("LEADING");                    try {                        setLeader(makeLeader(logFactory));                        leader.lead();                        setLeader(null);                    } catch (Exception e) {                        LOG.warn("Unexpected exception",e);                    } finally {                        if (leader != null) {                            leader.shutdown("Forcing shutdown");                            setLeader(null);                        }                        updateServerState();                    }                    break;                }                start_fle = Time.currentElapsedTime();            }        } finally {            LOG.warn("QuorumPeer main thread exited");            MBeanRegistry instance = MBeanRegistry.getInstance();            instance.unregister(jmxQuorumBean);            instance.unregister(jmxLocalPeerBean);            for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) {                instance.unregister(remotePeerBean);            }            jmxQuorumBean = null;            jmxLocalPeerBean = null;            jmxRemotePeerBean = null;        }    }

  FastLeaderElection.lookForLeader():

public Vote lookForLeader() throws InterruptedException {        try {            self.jmxLeaderElectionBean = new LeaderElectionBean();            MBeanRegistry.getInstance().register(                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);        } catch (Exception e) {            LOG.warn("Failed to register with JMX", e);            self.jmxLeaderElectionBean = null;        }        if (self.start_fle == 0) {           self.start_fle = Time.currentElapsedTime();        }        try {            HashMap recvset = new HashMap();            HashMap outofelection = new HashMap();            //等待200毫秒            int notTimeout = finalizeWait;            synchronized(this){                //逻辑时钟自增+1                logicalclock.incrementAndGet();                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());            }            LOG.info("New election. My id =  " + self.getId() +                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));            //发送投票信息            sendNotifications();            /*             * Loop in which we exchange notifications until we find a leader             */            //判断是否为Looking状态            while ((self.getPeerState() == ServerState.LOOKING) &&                    (!stop)){                /*                 * Remove next notification from queue, times out after 2 times                 * the termination time                 */                //获取接收其他Follow发送的投票信息                Notification n = recvqueue.poll(notTimeout,                        TimeUnit.MILLISECONDS);                /*                 * Sends more notifications if haven't received enough.                 * Otherwise processes new notification.                 */                //未收到投票信息                if(n == null){                    //判断是否和集群离线了                    if(manager.haveDelivered()){                        //未断开,发送投票                        sendNotifications();                    } else {                        //断开,重连                        manager.connectAll();                    }                    /*                     * Exponential backoff                     */                    int tmpTimeOut = notTimeout*2;                    notTimeout = (tmpTimeOut < maxNotificationInterval?                            tmpTimeOut : maxNotificationInterval);                    LOG.info("Notification time out: " + notTimeout);                } //接收到了投票,则处理收到的投票信息                else if (validVoter(n.sid) && validVoter(n.leader)) {                    /*                     * Only proceed if the vote comes from a replica in the current or next                     * voting view for a replica in the current or next voting view.                     */                    //其他节点的Server.state                    switch (n.state) {                    case LOOKING:                        //如果其他节点也为Looking状态,说明当前正处于选举阶段,则处理投票信息。                        // If notification > current, replace and send messages out                        //如果当前的epoch(投票轮次)小于其他的投票信息,则说明自己的投票轮次已经过时,则更新自己的投票轮次                        if (n.electionEpoch > logicalclock.get()) {                            //更新投票轮次                            logicalclock.set(n.electionEpoch);                            //清除收到的投票                            recvset.clear();                            //比对投票信息                            //如果本身的投票信息 低于 收到的的投票信息,则使用收到的投票信息,否则再次使用自身的投票信息进行发送投票。                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {                                //使用收到的投票信息                                updateProposal(n.leader, n.zxid, n.peerEpoch);                            } else {                                //使用自己的投票信息                                updateProposal(getInitId(),                                        getInitLastLoggedZxid(),                                        getPeerEpoch());                            }                            //发送投票信息                            sendNotifications();                        } else if (n.electionEpoch < logicalclock.get()) {                            //如果其他节点的epoch小于当前的epoch则丢弃                            if(LOG.isDebugEnabled()){                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"                                        + Long.toHexString(n.electionEpoch)                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));                            }                            break;                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,                                proposedLeader, proposedZxid, proposedEpoch)) {                            //同样的epoch,正常情况,所有节点基本处于同一轮次                            //如果自身投票信息 低于 收到的投票信息,则更新投票信息。并发送                            updateProposal(n.leader, n.zxid, n.peerEpoch);                            sendNotifications();                        }                        if(LOG.isDebugEnabled()){                            LOG.debug("Adding vote: from=" + n.sid +                                    ", proposed leader=" + n.leader +                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));                        }                        //投票信息Vote归档,收到的有效选票 票池                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));                        //统计投票结果 ,判断是否能结束选举                        if (termPredicate(recvset,                                new Vote(proposedLeader, proposedZxid,                                        logicalclock.get(), proposedEpoch))) {                            //如果已经选出leader                            // Verify if there is any change in the proposed leader                            while((n = recvqueue.poll(finalizeWait,                                    TimeUnit.MILLISECONDS)) != null){                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,                                        proposedLeader, proposedZxid, proposedEpoch)){                                    recvqueue.put(n);                                    break;                                }                            }                            /*                             * This predicate is true once we don't read any new                             * relevant message from the reception queue                             */                            //如果选票结果为当前节点,则更新ServerState,否则设置为Follwer                            if (n == null) {                                self.setPeerState((proposedLeader == self.getId()) ?                                        ServerState.LEADING: learningState());                                Vote endVote = new Vote(proposedLeader,                                        proposedZxid, proposedEpoch);                                leaveInstance(endVote);                                return endVote;                            }                        }                        break;                    case OBSERVING:                        LOG.debug("Notification from observer: " + n.sid);                        break;                    case FOLLOWING:                    case LEADING:                        /*                         * Consider all notifications from the same epoch                         * together.                         */                        //如果其他节点已经确定为Leader                        //如果同一个的投票轮次,则加入选票池                        //判断是否能过半选举出leader ,如果是,则checkLeader                        /*checkLeader:                         * 【是否能选举出leader】and                         * 【(如果投票leader为自身,且轮次一致) or                         * (如果所选leader不是自身信息在outofelection不为空,且leader的ServerState状态已经为leader)】                         *                         */                        if(n.electionEpoch == logicalclock.get()){                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));                            if(termPredicate(recvset, new Vote(n.leader,                                            n.zxid, n.electionEpoch, n.peerEpoch, n.state))                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {                                self.setPeerState((n.leader == self.getId()) ?                                        ServerState.LEADING: learningState());                                Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);                                leaveInstance(endVote);                                return endVote;                            }                        }                        /*                         * Before joining an established ensemble, verify that                         * a majority are following the same leader.                         * Only peer epoch is used to check that the votes come                         * from the same ensemble. This is because there is at                         * least one corner case in which the ensemble can be                         * created with inconsistent zxid and election epoch                         * info. However, given that only one ensemble can be                         * running at a single point in time and that each                         * epoch is used only once, using only the epoch to                         * compare the votes is sufficient.                         *                         * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732                         */                        outofelection.put(n.sid, new Vote(n.leader,                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));                        //说明此时 集群中存在别的轮次选举已经有了选举结果                        //比对outofelection选票池,是否能结束选举,同时检查leader信息                        //如果能结束选举 接收到的选票产生的leader通过checkLeader为true,则更新当前节点信息                        if (termPredicate(outofelection, new Vote(n.leader,                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))                                && checkLeader(outofelection, n.leader, IGNOREVALUE)) {                            synchronized(this){                                logicalclock.set(n.electionEpoch);                                self.setPeerState((n.leader == self.getId()) ?                                        ServerState.LEADING: learningState());                            }                            Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);                            leaveInstance(endVote);                            return endVote;                        }                        break;                    default:                        LOG.warn("Notification state unrecoginized: " + n.state                              + " (n.state), " + n.sid + " (n.sid)");                        break;                    }                } else {                    if (!validVoter(n.leader)) {                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);                    }                    if (!validVoter(n.sid)) {                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);                    }                }            }            return null;        } finally {            try {                if(self.jmxLeaderElectionBean != null){                    MBeanRegistry.getInstance().unregister(                            self.jmxLeaderElectionBean);                }            } catch (Exception e) {                LOG.warn("Failed to unregister with JMX", e);            }            self.jmxLeaderElectionBean = null;            LOG.debug("Number of connection processing threads: {}",                    manager.getConnectionThreadCount());        }    }

  lookForLeader方法中把当前选票和收到的选举进行不断的比对和更新,最终选出leader,其中比对选票的方法为totalOrderPredicate(): 其中的比对投票信息方式为:

  1.  首先判断epoch(选举轮次),也就是选择epoch值更大的节点;如果收到的epoch更大,则当前阶段落后,更新自己的epoch,否则丢弃。

  2.  如果同于轮次中,则选择zxid更大的节点,因为zxid越大说明数据越新。

  3.  如果同一轮次,且zxid一样,则选择serverId最大的节点。

  综上3点可理解为越大越棒!

  totalOrderPredicate():

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));        if(self.getQuorumVerifier().getWeight(newId) == 0){            return false;        }        /*         * We return true if one of the following three cases hold:         * 1- New epoch is higher         * 2- New epoch is the same as current epoch, but new zxid is higher         * 3- New epoch is the same as current epoch, new zxid is the same         *  as current zxid, but server id is higher.         */        return ((newEpoch > curEpoch) ||                ((newEpoch == curEpoch) &&                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));    }

 选举流程

  8c253c8c6e3fc0d050018ca93d12ab79.png

  整个选举过程可大致理解不断的接收选票,比对选票,直到选出leader,每个zookeeper节点都持有自己的选票池,按照统一的比对算法,正常情况下最终选出来的leader是一致的。

  end;

  本内容仅是zookeeper一部分源码解析,包括启动和选举;其中核心的zookeeper事物处理和一致性协议ZAB,后续再跟进。如果不对或不妥的地方,欢迎留言指出。

  Zookeeper github:https://github.com/apache/zookeeper/

  Apache zk:https://zookeeper.apache.org/releases.html

  源码部分注释来源:拉钩-子幕

a7e2a92a5526f57f82edb342b33a0778.png

223fe767ac262624f5479a02ccd3e2c2.png

466915c428fc91cdde8aa8a35586b038.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值