简介
关于Zookeeper,目前普遍的应用场景基本作为服务注册中心,用于服务发现。但这只是Zookeeper的一个的功能,根据Apache的官方概述:“The Apache ZooKeeper system for distributed coordination is a high-performance service for building distributed applications.” Zookeeper是一个用于构建分布式应用的coordination, 并且为高性能的。Zookeeper借助于它内部的节点结构和监听机制,能用于很大部分的分布式协调场景。配置管理、命名服务、分布式锁、服务发现和发布订阅等等,这些场景在Zookeeper中基本使用其节点的“变更+通知”来实现。因为分布式的重点在于通信,通信的作用也就是协调。
Zookeeper由Java语言编写(也有C语言的Api实现),对于其原理,算是Paxos算法的实现,包含了Leader、Follower、Proposal等角色和选举之类的一些概念,但于Paxos还有一些不同(ZAB协议)。对于Paxos算法,个人认为,它是一套解决方案的理论,想要理解也有点的复杂。这里对于算法不太深入概述,仅对于Zookeeper服务端进行部分源码解析,包含应用的启动和选举方面,不包含Client。
源码获取
Zookeeper源码可以从Github(https://github.com/apache/zookeeper)上clone下来;
也可从Zookeeper官网(Apache)https://zookeeper.apache.org/releases.html上获取。
Zookeeper在3.5.5之前使用的是Ant构建,在3.5.5开始使用的是Maven构建。
本次采用的3.5.4版本进行解析
工程结构
目录结构:
其中src中包含了C和Java源码,本次忽略C的Api。conf下为配置文件,也就是Zookeeper启动的配置文件。bin为Zookeeper启动脚本(server/client)。
org.apache.jute为Zookeeper的通信协议和序列化相关的组件,其通信协议基于TCP协议,它提供了Record接口用于序列化和反序列化,OutputArchive/InputArchive接口.
org.apache.zookeeper下为Zookeeper核心代码。包含了核心的业务实现。
启动流程
在我们使用Zookeeper的应用时,通过“./zkServer.sh start”命令来启动程序。通过查看zkServer.sh脚本,可以追踪到Zookeeper程序启动入口为“org.apache.zookeeper.server.quorum.QuorumPeerMain”,同时为程序指定了日志相关的配置。
ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"#.......nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" \ "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \ -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \ -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null & if [ $? -eq 0 ] #.......
Zookeeper启动流程:
QuorumPeerMain.main()接受至少一个参数,一般就一个参数,参数为zoo.cfg文件路径。main方法中没有很多的业务代码,实例化了一个QuorumPeerMain 对象,然后main.initializeAndRun(args);进行了实例化
public static void main(String[] args) { QuorumPeerMain main = new QuorumPeerMain(); try { main.initializeAndRun(args); } catch (IllegalArgumentException e) { LOG.error("Invalid arguments, exiting abnormally", e); LOG.info(USAGE); System.err.println(USAGE); System.exit(2); } catch (ConfigException e) { LOG.error("Invalid config, exiting abnormally", e); System.err.println("Invalid config, exiting abnormally"); System.exit(2); } catch (DatadirException e) { LOG.error("Unable to access datadir, exiting abnormally", e); System.err.println("Unable to access datadir, exiting abnormally"); System.exit(3); } catch (AdminServerException e) { LOG.error("Unable to start AdminServer, exiting abnormally", e); System.err.println("Unable to start AdminServer, exiting abnormally"); System.exit(4); } catch (Exception e) { LOG.error("Unexpected exception, exiting abnormally", e); System.exit(1); } LOG.info("Exiting normally"); System.exit(0); }
initializeAndRun方法则通过实例化QuorumPeerConfig对象,通过parseProperties()来解析zoo.cfg文件中的配置,QuorumPeerConfig包含了Zookeeper整个应用的配置属性。接着开启一个DatadirCleanupManager对象来开启一个Timer用于清除并创建管理新的DataDir相关的数据。
最后进行程序的启动,因为Zookeeper分为单机和集群模式,所以分为两种不同的启动方式,当zoo.cfg文件中配置了standaloneEnabled=true为单机模式,如果配置server.0,server.1......集群节点,则为集群模式.
protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException{ QuorumPeerConfig config = new QuorumPeerConfig(); if (args.length == 1) { config.parse(args[0]); } // Start and schedule the the purge task DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config .getDataDir(), config.getDataLogDir(), config .getSnapRetainCount(), config.getPurgeInterval()); purgeMgr.start(); // 当配置了多节点信息,return quorumVerifier!=null && (!standaloneEnabled || quorumVerifier.getVotingMembers().size() > 1); if (args.length == 1 && config.isDistributed()) { // 集群模式 runFromConfig(config); } else { LOG.warn("Either no config or no quorum defined in config, running " + " in standalone mode"); // there is only server in the quorum -- run as standalone // 单机模式 ZooKeeperServerMain.main(args); } }
单机模式启动
当配置了standaloneEnabled=true 或者没有配置集群节点(sever.*)时,Zookeeper使用单机环境启动。单机环境启动入口为ZooKeeperServerMain类,ZooKeeperServerMain类中持有ServerCnxnFactory、ContainerManager和AdminServer对象;
public class ZooKeeperServerMain { /*.............*/ // ZooKeeper server supports two kinds of connection: unencrypted and encrypted. private ServerCnxnFactory cnxnFactory; private ServerCnxnFactory secureCnxnFactory; private ContainerManager containerManager; private AdminServer adminServer; /*.............*/}
ServerCnxnFactory为Zookeeper中的核心组件,用于网络通信IO的实现和管理客户端连接,Zookeeper内部提供了两种实现,一种是基于JDK的NIO实现,一种是基于netty的实现。
ContainerManager类,用于管理维护Zookeeper中节点Znode的信息,管理zkDatabase;
AdminServer是一个Jetty服务,默认开启8080端口,用于提供Zookeeper的信息的查询接口。该功能从3.5的版本开始。
ZooKeeperServerMain的main方法中同QuorumPeerMain中一致,先实例化本身的对象,再进行init,加载配置文件,然后启动。
加载配置信息:
// 解析单机模式的配置对象,并启动单机模式 protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException{ try { //注册jmx // JMX的全称为Java Management Extensions.是管理Java的一种扩展。 // 这种机制可以方便的管理、监控正在运行中的Java程序。常用于管理线程,内存,日志Level,服务重启,系统环境等 ManagedUtil.registerLog4jMBeans(); } catch (JMException e) { LOG.warn("Unable to register log4j JMX control", e); } // 创建服务配置对象 ServerConfig config = new ServerConfig(); //如果入参只有一个,则认为是配置文件的路径 if (args.length == 1) { // 解析配置文件 config.parse(args[0]); } else { // 参数有多个,解析参数 config.parse(args); } // 根据配置运行服务 runFromConfig(config); }
服务启动: runFromConfig()为应用启动之前初始化一些对象,
1. 初始化FileTxnSnapLog对象,用于管理dataDir和datalogDir数据。
2. 初始化ZooKeeperServer 对象;
3. 实例化CountDownLatch线程计数器对象,在程序启动后,执行shutdownLatch.await();用于挂起主程序,并监听Zookeeper运行状态。
4. 创建adminServer(Jetty)服务并开启。
5. 创建ServerCnxnFactory对象,cnxnFactory = ServerCnxnFactory.createFactory(); Zookeeper默认使用NIOServerCnxnFactory来实现网络通信IO。
6. 启动ServerCnxnFactory服务
7. 创建ContainerManager对象,并启动;
8. Zookeeper应用启动。
public void runFromConfig(ServerConfig config) throws IOException, AdminServerException { LOG.info("Starting server"); FileTxnSnapLog txnLog = null; try { // Note that this thread isn't going to be doing anything else, // so rather than spawning another thread, we will just call // run() in this thread. // create a file logger url from the command line args //初始化日志文件 txnLog = new FileTxnSnapLog(config.dataLogDir, config.dataDir); // 初始化zkServer对象 final ZooKeeperServer zkServer = new ZooKeeperServer(txnLog, config.tickTime, config.minSessionTimeout, config.maxSessionTimeout, null); // 服务结束钩子,用于知道服务器错误或关闭状态更改。 final CountDownLatch shutdownLatch = new CountDownLatch(1); zkServer.registerServerShutdownHandler( new ZooKeeperServerShutdownHandler(shutdownLatch)); // Start Admin server // 创建admin服务,用于接收请求(创建jetty服务) adminServer = AdminServerFactory.createAdminServer(); // 设置zookeeper服务 adminServer.setZooKeeperServer(zkServer); // AdminServer是3.5.0之后支持的特性,启动了一个jettyserver,默认端口是8080,访问此端口可以获取Zookeeper运行时的相关信息 adminServer.start(); boolean needStartZKServer = true; //---启动ZooKeeperServer //判断配置文件中 clientportAddress是否为null if (config.getClientPortAddress() != null) { //ServerCnxnFactory是Zookeeper中的重要组件,负责处理客户端与服务器的连接 //初始化server端IO对象,默认是NIOServerCnxnFactory:Java原生NIO处理网络IO事件 cnxnFactory = ServerCnxnFactory.createFactory(); //初始化配置信息 cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), false); //启动服务:此方法除了启动ServerCnxnFactory,还会启动ZooKeeper cnxnFactory.startup(zkServer); // zkServer has been started. So we don't need to start it again in secureCnxnFactory. needStartZKServer = false; } if (config.getSecureClientPortAddress() != null) { secureCnxnFactory = ServerCnxnFactory.createFactory(); secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), true); secureCnxnFactory.startup(zkServer, needStartZKServer); } // 定时清除容器节点 //container ZNodes是3.6版本之后新增的节点类型,Container类型的节点会在它没有子节点时 // 被删除(新创建的Container节点除外),该类就是用来周期性的进行检查清理工作 containerManager = new ContainerManager(zkServer.getZKDatabase(), zkServer.firstProcessor, Integer.getInteger("znode.container.checkIntervalMs", (int) TimeUnit.MINUTES.toMillis(1)), Integer.getInteger("znode.container.maxPerMinute", 10000) ); containerManager.start(); // Watch status of ZooKeeper server. It will do a graceful shutdown // if the server is not running or hits an internal error. // ZooKeeperServerShutdownHandler处理逻辑,只有在服务运行不正常的情况下,才会往下执行 shutdownLatch.await(); // 关闭服务 shutdown(); if (cnxnFactory != null) { cnxnFactory.join(); } if (secureCnxnFactory != null) { secureCnxnFactory.join(); } if (zkServer.canShutdown()) { zkServer.shutdown(true); } } catch (InterruptedException e) { // warn, but generally this is ok LOG.warn("Server interrupted", e); } finally { if (txnLog != null) { txnLog.close(); } } }
Zookeeper中 ServerCnxnFactory默认采用了NIOServerCnxnFactory来实现,也可以通过配置系统属性zookeeper.serverCnxnFactory 来设置使用Netty实现;
static public ServerCnxnFactory createFactory() throws IOException { String serverCnxnFactoryName = System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY); if (serverCnxnFactoryName == null) { //如果未指定实现类,默认使用NIOServerCnxnFactory serverCnxnFactoryName = NIOServerCnxnFactory.class.getName(); } try { ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName) .getDeclaredConstructor().newInstance(); LOG.info("Using {} as server connection factory", serverCnxnFactoryName); return serverCnxnFactory; } catch (Exception e) { IOException ioe = new IOException("Couldn't instantiate " + serverCnxnFactoryName); ioe.initCause(e); throw ioe; }
}
cnxnFactory.startup(zkServer);方法启动了ServerCnxnFactory ,同时启动ZooKeeper服务
public void startup(ZooKeeperServer zks, boolean startServer) throws IOException, InterruptedException { // 启动相关线程 //开启NIOWorker线程池, //启动NIO Selector线程 //启动客户端连接处理acceptThread线程 start(); setZooKeeperServer(zks); //启动服务 if (startServer) { // 加载数据到zkDataBase zks.startdata(); // 启动定时清除session的管理器,注册jmx,添加请求处理器 zks.startup(); } }
zks.startdata();
public void startdata() throws IOException, InterruptedException { //初始化ZKDatabase,该数据结构用来保存ZK上面存储的所有数据 //check to see if zkDb is not null if (zkDb == null) { //初始化数据数据,这里会加入一些原始节点,例如/zookeeper zkDb = new ZKDatabase(this.txnLogFactory); } //加载磁盘上已经存储的数据,如果有的话 if (!zkDb.isInitialized()) { loadData(); } }
zks.startup();
public synchronized void startup() { //初始化session追踪器 if (sessionTracker == null) { createSessionTracker(); } //启动session追踪器 startSessionTracker(); //建立请求处理链路 setupRequestProcessors(); //注册jmx registerJMX(); setState(State.RUNNING); notifyAll(); }
最终Zookeeper应用服务启动,并处于监听状态。
集群模式启动
Zookeeper主程序QuorumPeerMain加载配置文件后,配置容器对象QuorumPeerConfig中持有一个QuorumVerifier对象,该对象会存储其他Zookeeper server节点信息,如果zoo.cfg中配置了server.*节点信息,会实例化一个QuorumVeriferi对象。其中AllMembers = VotingMembers + ObservingMembers
public interface QuorumVerifier { long getWeight(long id); boolean containsQuorum(Set set); long getVersion(); void setVersion(long ver); MapgetAllMembers(); MapgetVotingMembers(); MapgetObservingMembers(); boolean equals(Object o); String toString();}
如果quorumVerifier.getVotingMembers().size() > 1 则使用集群模式启动。调用runFromConfig(QuorumPeerConfig config),同时会实例化ServerCnxnFactory 对象,初始化一个QuorumPeer对象。
QuorumPeer为一个Zookeeper节点, QuorumPeer 为一个线程类,代表一个Zookeeper服务线程,最终会启动该线程。
runFromConfig方法中设置了一些列属性。包括选举类型、server Id、节点数据库等信息。最后通过quorumPeer.start();启动Zookeeper节点。
public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException { try { // 注册jmx ManagedUtil.registerLog4jMBeans(); } catch (JMException e) { LOG.warn("Unable to register log4j JMX control", e); } LOG.info("Starting quorum peer"); try { ServerCnxnFactory cnxnFactory = null; ServerCnxnFactory secureCnxnFactory = null; if (config.getClientPortAddress() != null) { cnxnFactory = ServerCnxnFactory.createFactory(); // 配置客户端连接端口 cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), false); } if (config.getSecureClientPortAddress() != null) { secureCnxnFactory = ServerCnxnFactory.createFactory(); // 配置安全连接端口 secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), true); } // ------------初始化当前zk服务节点的配置---------------- // 设置数据和快照操作 quorumPeer = getQuorumPeer(); quorumPeer.setTxnFactory(new FileTxnSnapLog( config.getDataLogDir(), config.getDataDir())); quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled()); quorumPeer.enableLocalSessionsUpgrading( config.isLocalSessionsUpgradingEnabled()); //quorumPeer.setQuorumPeers(config.getAllMembers()); // 选举类型 quorumPeer.setElectionType(config.getElectionAlg()); // server Id quorumPeer.setMyid(config.getServerId()); quorumPeer.setTickTime(config.getTickTime()); quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout()); quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout()); quorumPeer.setInitLimit(config.getInitLimit()); quorumPeer.setSyncLimit(config.getSyncLimit()); quorumPeer.setConfigFileName(config.getConfigFilename()); // 设置zk的节点数据库 quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory())); quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false); if (config.getLastSeenQuorumVerifier()!=null) { quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false); } // 初始化zk数据库 quorumPeer.initConfigInZKDatabase(); quorumPeer.setCnxnFactory(cnxnFactory); quorumPeer.setSecureCnxnFactory(secureCnxnFactory); quorumPeer.setLearnerType(config.getPeerType()); quorumPeer.setSyncEnabled(config.getSyncEnabled()); quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs()); // sets quorum sasl authentication configurations quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl); if(quorumPeer.isQuorumSaslAuthEnabled()){ quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl); quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl); quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal); quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext); quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext); } quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize); // -------------初始化当前zk服务节点的配置--------------- quorumPeer.initialize(); //启动 quorumPeer.start(); quorumPeer.join(); } catch (InterruptedException e) { // warn, but generally this is ok LOG.warn("Quorum Peer interrupted", e); } }
quorumPeer.start(); Zookeeper会首先加载本地磁盘数据,如果之前存在一些Zookeeper信息,则会加载到Zookeeper内存数据库中。通过FileTxnSnapLog中的loadDatabse();
public synchronized void start() { // 校验serverid如果不在peer列表中,抛异常 if (!getView().containsKey(myid)) { throw new RuntimeException("My id " + myid + " not in the peer list"); } // 加载zk数据库:载入之前持久化的一些信息 loadDataBase(); // 启动连接服务端 startServerCnxnFactory(); try { adminServer.start(); } catch (AdminServerException e) { LOG.warn("Problem starting AdminServer", e); System.out.println(e); } // 启动之后马上进行选举,主要是创建选举必须的环境,比如:启动相关线程 startLeaderElection(); // 执行选举逻辑 super.start(); }
加载数据完之后同单机模式启动一样,会调用ServerCnxnFactory.start(),启动NIOServerCnxnFactory服务和Zookeeper服务,最后启动AdminServer服务。
与单机模式启动不同的是,集群会在启动之后马上进行选举操作,会在配置的所有Zookeeper server节点中选举出一个leader角色。startLeaderElection();
选举
Zookeeper中分为Leader、Follower和Observer三个角色,各个角色扮演不同的业务功能。在Leader故障之后,Follower也会选举一个新的Leader。
Leader为集群中的主节点,一个集群只有一个Leader,Leader负责处理Zookeeper的事物操作,也就是更改Zookeeper数据和状态的操作。
Follower负责处理客户端的读请求和参与选举。同时负责处理Leader发出的事物提交请求,也就是提议(proposal)。
Observer用于提高Zookeeper集群的读取的吞吐量,响应读请求,和Follower不同的是,Observser不参与Leader的选举,也不响应Leader发出的proposal。
有角色就有选举。有选举就有策略,Zookeeper中的选举策略有三种实现:包括了LeaderElection、AuthFastLeaderElection和FastLeaderElection,目前Zookeeper默认采用FastLeaderElection,前两个选举算法已经设置为@Deprecated;
Zookeeper节点信息
serverId:服务节点Id,也就是Zookeeper dataDir中配置的myid ,server.*上指定的id。0,1,2,3,4..... ,该Id启动后不变
zxid:数据状态Id,zookeeper每次更新状态之后增加,可理解为全局有序id ,zxid越大,表示数据越新。Zxid是一个64位的数字,高32位为epoch,低32位为递增计数。
epoch:选举时钟,也可以理解为选举轮次,没进行一次选举,该值会+1;
ServerState:服务状态,Zookeeper节点角色状态,分为LOOKING、FOLLOWING、LEADING和OBSERVING,分别对应于不同的角色,当处于选举时,节点处于Looking状态。
每次投票,一个Vote会包含Zookeeper节点信息。
Zookeeper在启动之后会马上进行选举操作,不断的向其他Follower节点发送选票信息,同时也接收别的Follower发送过来的选票信息。最终每个Follower都持有共同的一个选票池,通过同样的算法选出Leader,如果当前节点选为Leader,则向其他每个Follower发送信息,如果没有则向Leader发送信息。
Zookeeper定义了Election接口;其中lookForLeader()就是选举操作。
public interface Election {2 public Vote lookForLeader() throws InterruptedException;3 public void shutdown();4 }
在上面的集群模式启动流程中,最后会调用startLeaderElection()来下进行选举操作。startLeaderElection()中指定了选举算法。同时定义了为自己投一票(坚持你自己,年轻人!),一个Vote包含了投票节点、当前节点的zxid和当前的epoch。Zookeeper默认采取了FastLeaderElection选举算法。最后启动QuorumPeer线程,开始投票。
synchronized public void startLeaderElection() { try { // 所有节点启动的初始状态都是LOOKING,因此这里都会是创建一张投自己为Leader的票 if (getPeerState() == ServerState.LOOKING) { currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch()); } } catch(IOException e) { RuntimeException re = new RuntimeException(e.getMessage()); re.setStackTrace(e.getStackTrace()); throw re; } // if (!getView().containsKey(myid)) { // throw new RuntimeException("My id " + myid + " not in the peer list"); //} if (electionType == 0) { try { udpSocket = new DatagramSocket(myQuorumAddr.getPort()); responder = new ResponderThread(); responder.start(); } catch (SocketException e) { throw new RuntimeException(e); } } //初始化选举算法,electionType默认为3 this.electionAlg = createElectionAlgorithm(electionType); }
FastLeaderElection类中定义三个内部类Notification、 ToSend 和 Messenger ,Messenger 中又定义了WorkerReceiver 和 WorkerSender
Notification类表示收到的选举投票信息(其他服务器发来的选举投票信息),其包含了被选举者的id、zxid、选举周期等信息。
ToSend类表示发送给其他服务器的选举投票信息,也包含了被选举者的id、zxid、选举周期等信息。
Message类为消息处理的类,用于发送和接收投票信息,包含了WorkerReceiver和WorkerSender两个线程类。
FastLeaderElection类:
public class FastLeaderElection implements Election { //.......... /** * Connection manager. Fast leader election uses TCP for * communication between peers, and QuorumCnxManager manages * such connections. */ QuorumCnxManager manager; /* Notification表示收到的选举投票信息(其他服务器发来的选举投票信息), 其包含了被选举者的id、zxid、选举周期等信息, 其buildMsg方法将选举信息封装至ByteBuffer中再进行发送 */ static public class Notification { //.......... } /** * Messages that a peer wants to send to other peers. * These messages can be both Notifications and Acks * of reception of notification. */ /* ToSend表示发送给其他服务器的选举投票信息,也包含了被选举者的id、zxid、选举周期等信息 */ static public class ToSend { //.......... } LinkedBlockingQueue sendqueue; LinkedBlockingQueue recvqueue; /** * Multi-threaded implementation of message handler. Messenger * implements two sub-classes: WorkReceiver and WorkSender. The * functionality of each is obvious from the name. Each of these * spawns a new thread. */ protected class Messenger { /** * Receives messages from instance of QuorumCnxManager on * method run(), and processes such messages. */ class WorkerReceiver extends ZooKeeperThread { //.......... } /** * This worker simply dequeues a message to send and * and queues it on the manager's queue. */ class WorkerSender extends ZooKeeperThread { //.......... } WorkerSender ws; WorkerReceiver wr; Thread wsThread = null; Thread wrThread = null; } //.......... QuorumPeer self; Messenger messenger; AtomicLong logicalclock = new AtomicLong(); /* Election instance */ long proposedLeader; long proposedZxid; long proposedEpoch; //..........}
QuorumPeer线程启动后会开启对ServerState的监听,如果当前服务节点属于Looking状态,则会执行选举操作。Zookeeper服务器启动后是Looking状态,所以服务启动后会马上进行选举操作。通过调用makeLEStrategy().lookForLeader()进行投票操作,也就是FastLeaderElection.lookForLeader();
QuorumPeer.run():
public void run() { updateThreadName(); //.......... try { /* * Main loop */ while (running) { switch (getPeerState()) { case LOOKING: LOG.info("LOOKING"); if (Boolean.getBoolean("readonlymode.enabled")) { final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb); Thread roZkMgr = new Thread() { public void run() { try { // lower-bound grace period to 2 secs sleep(Math.max(2000, tickTime)); if (ServerState.LOOKING.equals(getPeerState())) { roZk.startup(); } } catch (InterruptedException e) { LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started"); } catch (Exception e) { LOG.error("FAILED to start ReadOnlyZooKeeperServer", e); } } }; try { roZkMgr.start(); reconfigFlagClear(); if (shuttingDownLE) { shuttingDownLE = false; startLeaderElection(); } setCurrentVote(makeLEStrategy().lookForLeader()); } catch (Exception e) { LOG.warn("Unexpected exception", e); setPeerState(ServerState.LOOKING); } finally { roZkMgr.interrupt(); roZk.shutdown(); } } else { try { reconfigFlagClear(); if (shuttingDownLE) { shuttingDownLE = false; startLeaderElection(); } setCurrentVote(makeLEStrategy().lookForLeader()); } catch (Exception e) { LOG.warn("Unexpected exception", e); setPeerState(ServerState.LOOKING); } } break; case OBSERVING: try { LOG.info("OBSERVING"); setObserver(makeObserver(logFactory)); observer.observeLeader(); } catch (Exception e) { LOG.warn("Unexpected exception",e ); } finally { observer.shutdown(); setObserver(null); updateServerState(); } break; case FOLLOWING: try { LOG.info("FOLLOWING"); setFollower(makeFollower(logFactory)); follower.followLeader(); } catch (Exception e) { LOG.warn("Unexpected exception",e); } finally { follower.shutdown(); setFollower(null); updateServerState(); } break; case LEADING: LOG.info("LEADING"); try { setLeader(makeLeader(logFactory)); leader.lead(); setLeader(null); } catch (Exception e) { LOG.warn("Unexpected exception",e); } finally { if (leader != null) { leader.shutdown("Forcing shutdown"); setLeader(null); } updateServerState(); } break; } start_fle = Time.currentElapsedTime(); } } finally { LOG.warn("QuorumPeer main thread exited"); MBeanRegistry instance = MBeanRegistry.getInstance(); instance.unregister(jmxQuorumBean); instance.unregister(jmxLocalPeerBean); for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) { instance.unregister(remotePeerBean); } jmxQuorumBean = null; jmxLocalPeerBean = null; jmxRemotePeerBean = null; } }
FastLeaderElection.lookForLeader():
public Vote lookForLeader() throws InterruptedException { try { self.jmxLeaderElectionBean = new LeaderElectionBean(); MBeanRegistry.getInstance().register( self.jmxLeaderElectionBean, self.jmxLocalPeerBean); } catch (Exception e) { LOG.warn("Failed to register with JMX", e); self.jmxLeaderElectionBean = null; } if (self.start_fle == 0) { self.start_fle = Time.currentElapsedTime(); } try { HashMap recvset = new HashMap(); HashMap outofelection = new HashMap(); //等待200毫秒 int notTimeout = finalizeWait; synchronized(this){ //逻辑时钟自增+1 logicalclock.incrementAndGet(); updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); } LOG.info("New election. My id = " + self.getId() + ", proposed zxid=0x" + Long.toHexString(proposedZxid)); //发送投票信息 sendNotifications(); /* * Loop in which we exchange notifications until we find a leader */ //判断是否为Looking状态 while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){ /* * Remove next notification from queue, times out after 2 times * the termination time */ //获取接收其他Follow发送的投票信息 Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); /* * Sends more notifications if haven't received enough. * Otherwise processes new notification. */ //未收到投票信息 if(n == null){ //判断是否和集群离线了 if(manager.haveDelivered()){ //未断开,发送投票 sendNotifications(); } else { //断开,重连 manager.connectAll(); } /* * Exponential backoff */ int tmpTimeOut = notTimeout*2; notTimeout = (tmpTimeOut < maxNotificationInterval? tmpTimeOut : maxNotificationInterval); LOG.info("Notification time out: " + notTimeout); } //接收到了投票,则处理收到的投票信息 else if (validVoter(n.sid) && validVoter(n.leader)) { /* * Only proceed if the vote comes from a replica in the current or next * voting view for a replica in the current or next voting view. */ //其他节点的Server.state switch (n.state) { case LOOKING: //如果其他节点也为Looking状态,说明当前正处于选举阶段,则处理投票信息。 // If notification > current, replace and send messages out //如果当前的epoch(投票轮次)小于其他的投票信息,则说明自己的投票轮次已经过时,则更新自己的投票轮次 if (n.electionEpoch > logicalclock.get()) { //更新投票轮次 logicalclock.set(n.electionEpoch); //清除收到的投票 recvset.clear(); //比对投票信息 //如果本身的投票信息 低于 收到的的投票信息,则使用收到的投票信息,否则再次使用自身的投票信息进行发送投票。 if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) { //使用收到的投票信息 updateProposal(n.leader, n.zxid, n.peerEpoch); } else { //使用自己的投票信息 updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); } //发送投票信息 sendNotifications(); } else if (n.electionEpoch < logicalclock.get()) { //如果其他节点的epoch小于当前的epoch则丢弃 if(LOG.isDebugEnabled()){ LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch) + ", logicalclock=0x" + Long.toHexString(logicalclock.get())); } break; } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) { //同样的epoch,正常情况,所有节点基本处于同一轮次 //如果自身投票信息 低于 收到的投票信息,则更新投票信息。并发送 updateProposal(n.leader, n.zxid, n.peerEpoch); sendNotifications(); } if(LOG.isDebugEnabled()){ LOG.debug("Adding vote: from=" + n.sid + ", proposed leader=" + n.leader + ", proposed zxid=0x" + Long.toHexString(n.zxid) + ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch)); } //投票信息Vote归档,收到的有效选票 票池 recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); //统计投票结果 ,判断是否能结束选举 if (termPredicate(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch))) { //如果已经选出leader // Verify if there is any change in the proposed leader while((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null){ if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)){ recvqueue.put(n); break; } } /* * This predicate is true once we don't read any new * relevant message from the reception queue */ //如果选票结果为当前节点,则更新ServerState,否则设置为Follwer if (n == null) { self.setPeerState((proposedLeader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(proposedLeader, proposedZxid, proposedEpoch); leaveInstance(endVote); return endVote; } } break; case OBSERVING: LOG.debug("Notification from observer: " + n.sid); break; case FOLLOWING: case LEADING: /* * Consider all notifications from the same epoch * together. */ //如果其他节点已经确定为Leader //如果同一个的投票轮次,则加入选票池 //判断是否能过半选举出leader ,如果是,则checkLeader /*checkLeader: * 【是否能选举出leader】and * 【(如果投票leader为自身,且轮次一致) or * (如果所选leader不是自身信息在outofelection不为空,且leader的ServerState状态已经为leader)】 * */ if(n.electionEpoch == logicalclock.get()){ recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); if(termPredicate(recvset, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state)) && checkLeader(outofelection, n.leader, n.electionEpoch)) { self.setPeerState((n.leader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch); leaveInstance(endVote); return endVote; } } /* * Before joining an established ensemble, verify that * a majority are following the same leader. * Only peer epoch is used to check that the votes come * from the same ensemble. This is because there is at * least one corner case in which the ensemble can be * created with inconsistent zxid and election epoch * info. However, given that only one ensemble can be * running at a single point in time and that each * epoch is used only once, using only the epoch to * compare the votes is sufficient. * * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732 */ outofelection.put(n.sid, new Vote(n.leader, IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)); //说明此时 集群中存在别的轮次选举已经有了选举结果 //比对outofelection选票池,是否能结束选举,同时检查leader信息 //如果能结束选举 接收到的选票产生的leader通过checkLeader为true,则更新当前节点信息 if (termPredicate(outofelection, new Vote(n.leader, IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)) && checkLeader(outofelection, n.leader, IGNOREVALUE)) { synchronized(this){ logicalclock.set(n.electionEpoch); self.setPeerState((n.leader == self.getId()) ? ServerState.LEADING: learningState()); } Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch); leaveInstance(endVote); return endVote; } break; default: LOG.warn("Notification state unrecoginized: " + n.state + " (n.state), " + n.sid + " (n.sid)"); break; } } else { if (!validVoter(n.leader)) { LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid); } if (!validVoter(n.sid)) { LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid); } } } return null; } finally { try { if(self.jmxLeaderElectionBean != null){ MBeanRegistry.getInstance().unregister( self.jmxLeaderElectionBean); } } catch (Exception e) { LOG.warn("Failed to unregister with JMX", e); } self.jmxLeaderElectionBean = null; LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount()); } }
lookForLeader方法中把当前选票和收到的选举进行不断的比对和更新,最终选出leader,其中比对选票的方法为totalOrderPredicate(): 其中的比对投票信息方式为:
1. 首先判断epoch(选举轮次),也就是选择epoch值更大的节点;如果收到的epoch更大,则当前阶段落后,更新自己的epoch,否则丢弃。
2. 如果同于轮次中,则选择zxid更大的节点,因为zxid越大说明数据越新。
3. 如果同一轮次,且zxid一样,则选择serverId最大的节点。
综上3点可理解为越大越棒!
totalOrderPredicate():
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) { LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" + Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid)); if(self.getQuorumVerifier().getWeight(newId) == 0){ return false; } /* * We return true if one of the following three cases hold: * 1- New epoch is higher * 2- New epoch is the same as current epoch, but new zxid is higher * 3- New epoch is the same as current epoch, new zxid is the same * as current zxid, but server id is higher. */ return ((newEpoch > curEpoch) || ((newEpoch == curEpoch) && ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId))))); }
选举流程
整个选举过程可大致理解不断的接收选票,比对选票,直到选出leader,每个zookeeper节点都持有自己的选票池,按照统一的比对算法,正常情况下最终选出来的leader是一致的。
end;
本内容仅是zookeeper一部分源码解析,包括启动和选举;其中核心的zookeeper事物处理和一致性协议ZAB,后续再跟进。如果不对或不妥的地方,欢迎留言指出。
Zookeeper github:https://github.com/apache/zookeeper/
Apache zk:https://zookeeper.apache.org/releases.html
源码部分注释来源:拉钩-子幕