zookeeper选举源码解析

                                Zookeeper快速选举流程详解

在讲解流程之前,先说明一下选举流程中涉及到的角色,以及涉及到的关键类和变量(源码参考版本:3.4.9):

角色:1.LOOKING:竞选

           2.OBSERVING:观察

           3.FOLLOWING:跟随者

           4.LEADER:领导者

投票信息:

           1.logicalclock(electionEpoch):本地选举周期,每次投票都会自增

           2.epoch(peerEpoch):选举周期,每次选举最终确定完leader结束选举流程时会自增(真正zxid的前32位)

           3.zxid:数据ID,每次数据变动都会自增(真正zxid的后32位,zxid一共64位)

           4.sid:该投票信息所属的serverId

           5.leader:提议的leader(被提议的server的serverId,即sid)

投票比较规则:

          1.epoch大的胜出,否则进行步骤2

          2.zxid大的胜出,否则进行步骤3

          3.sid大的胜出

比较规则的源码如下:


 
 
  1. /**
  2. * Check if a pair (server id, zxid) succeeds our
  3. * current vote.
  4. *
  5. * @param id Server identifier
  6. * @param zxid Last zxid observed by the issuer of this vote
  7. */
  8. protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
  9. LOG.debug( "id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
  10. Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
  11. if(self.getQuorumVerifier().getWeight(newId) == 0){
  12. return false;
  13. }
  14. /*
  15. * We return true if one of the following three cases hold:
  16. * 1- New epoch is higher
  17. * 2- New epoch is the same as current epoch, but new zxid is higher
  18. * 3- New epoch is the same as current epoch, new zxid is the same
  19. * as current zxid, but server id is higher.
  20. */
  21. return ((newEpoch > curEpoch) ||
  22. ((newEpoch == curEpoch) &&
  23. ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
  24. }

下面首先讲解一下大概的选举流程,这里暂时先不用考虑投票的数据是如何进行交互的,只管拿来用即可,后续会讲到选举期间投票数据是如何进行交互的。

1.首先更新logicalclock并提议自己为leader并广播出去

2.进入本轮投票的循环

3.从recvqueue队列中获取一个投票信息,如果为空则检查是否要重发自己的投票或者重连,否则进入步骤4

4.判断投票信息中的选举状态:

        LOOKING状态:1.如果对方的logicalclock大于本地的logicalclock,则更新本地的logicalclock并清空本地投票信息统计箱recvset,并将自己作为候选和投票中的leader进行比较,选择大的作为新的投票,然后广播出去,否则进入步骤2

                                    2.如果对方的logicalclock小于本地的logicalclock,则忽略对方的投票,重新进入下一轮选举流程,否则进入步骤3

                                    3.如果两方的logicalclock相等,则比较当前本地被推选的leader和投票中的leader,选择大的作为新的投票,然后广播出去

                                     4.把对方的投票信息保存到本地投票统计箱recvset中,判断当前被选举的leader是否在投票中占了大多数(大于一半的server数量),如果是则需再等待finalizeWait时间(从recvqueue继续poll投票消息)看是否有人修改了leader的候选,如果有则再将该投票信息再放回recvqueue中并重新开始下一轮循环,否则确定角色,结束选举

        OBSERVING状态:没有投票权,无视直接进入下一轮选举

        FOLLOWING/LEADING:1.如果对方的logicalclock等于本地的logicalclock,把对方的投票信息保存到本地投票统计箱recvset中,判断对方的投票信息是否在recvset中占大多数并且确认自己确实为leader,如果是则确定角色,结束选举,否则进入步骤2

                                                   2.将对方的投票信息放入本地统计不参与投票信息箱outofelection中,判断对方的投票信息是否在outofelection中占大多数并且确认自己确实为leader,如果是则更新logicalclock,并确定角色,结束选举,否则进入下一轮选举

选举流程源码如下:


 
 
  1. /**
  2. * Starts a new round of leader election. Whenever our QuorumPeer
  3. * changes its state to LOOKING, this method is invoked, and it
  4. * sends notifications to all other peers.
  5. *
  6. * 开始新的一轮leader选举。
  7. * 每当当前的peer的选举状态为LOOKING时,这个方法就会执行,并且会向其他peer发送提议leader消息。
  8. *
  9. */
  10. public Vote lookForLeader() throws InterruptedException {
  11. try {
  12. self.jmxLeaderElectionBean = new LeaderElectionBean();
  13. MBeanRegistry.getInstance().register(
  14. self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
  15. } catch ( Exception e) {
  16. LOG.warn( "Failed to register with JMX", e);
  17. self.jmxLeaderElectionBean = null;
  18. }
  19. if ( self.start_fle == 0) {
  20. self.start_fle = System.currentTimeMillis();
  21. }
  22. try {
  23. //本机统计的投票信息
  24. HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
  25. //FOLLOWING LEADING状态的节点信息-->非LOOKING状态
  26. HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
  27. int notTimeout = finalizeWait;
  28. //提议选举自己为leader
  29. synchronized(this){
  30. logicalclock++;
  31. updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
  32. }
  33. LOG.info( "New election. My id = " + self.getId() +
  34. ", proposed zxid=0x" + Long.toHexString(proposedZxid));
  35. sendNotifications();
  36. /*
  37. * Loop in which we exchange notifications until we find a leader
  38. *
  39. * 循环:开始交换提议信息,直到选举出leader
  40. */
  41. while (( self.getPeerState() == ServerState.LOOKING) &&
  42. (!stop)){
  43. /*
  44. * Remove next notification from queue, times out after 2 times
  45. * the termination time
  46. */
  47. Notification n = recvqueue.poll(notTimeout,
  48. TimeUnit.MILLISECONDS);
  49. /*
  50. * Sends more notifications if haven't received enough.
  51. * Otherwise processes new notification.
  52. */
  53. if(n == null){
  54. if(manager.haveDelivered()){
  55. sendNotifications();
  56. } else {
  57. manager.connectAll();
  58. }
  59. /*
  60. * Exponential backoff
  61. */
  62. int tmpTimeOut = notTimeout* 2;
  63. notTimeout = (tmpTimeOut < maxNotificationInterval?
  64. tmpTimeOut : maxNotificationInterval);
  65. LOG.info( "Notification time out: " + notTimeout);
  66. }
  67. else if( self.getVotingView().containsKey(n.sid)) {
  68. /*
  69. * Only proceed if the vote comes from a replica in the
  70. * voting view.
  71. */
  72. switch (n.state) {
  73. case LOOKING:
  74. // If notification > current, replace and send messages out
  75. if (n.electionEpoch > logicalclock) {
  76. logicalclock = n.electionEpoch;
  77. recvset.clear();
  78. if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
  79. getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
  80. updateProposal(n.leader, n.zxid, n.peerEpoch);
  81. } else {
  82. updateProposal(getInitId(),
  83. getInitLastLoggedZxid(),
  84. getPeerEpoch());
  85. }
  86. sendNotifications();
  87. } else if (n.electionEpoch < logicalclock) {
  88. if(LOG.isDebugEnabled()){
  89. LOG.debug( "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
  90. + Long.toHexString(n.electionEpoch)
  91. + ", logicalclock=0x" + Long.toHexString(logicalclock));
  92. }
  93. break;
  94. } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
  95. proposedLeader, proposedZxid, proposedEpoch)) {
  96. updateProposal(n.leader, n.zxid, n.peerEpoch);
  97. sendNotifications();
  98. }
  99. if(LOG.isDebugEnabled()){
  100. LOG.debug( "Adding vote: from=" + n.sid +
  101. ", proposed leader=" + n.leader +
  102. ", proposed zxid=0x" + Long.toHexString(n.zxid) +
  103. ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
  104. }
  105. // 把对方的投票意愿缓存起来,用于最终的统计
  106. recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
  107. if (termPredicate(recvset,
  108. new Vote(proposedLeader, proposedZxid,
  109. logicalclock, proposedEpoch))) {
  110. // Verify if there is any change in the proposed leader
  111. while((n = recvqueue.poll(finalizeWait,
  112. TimeUnit.MILLISECONDS)) != null){
  113. if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
  114. proposedLeader, proposedZxid, proposedEpoch)){
  115. recvqueue.put(n);
  116. break;
  117. }
  118. }
  119. /*
  120. * This predicate is true once we don't read any new
  121. * relevant message from the reception queue
  122. */
  123. if (n == null) {
  124. self.setPeerState((proposedLeader == self.getId()) ?
  125. ServerState.LEADING: learningState());
  126. Vote endVote = new Vote(proposedLeader,
  127. proposedZxid,
  128. logicalclock,
  129. proposedEpoch);
  130. leaveInstance(endVote);
  131. return endVote;
  132. }
  133. }
  134. break;
  135. case OBSERVING:
  136. LOG.debug( "Notification from observer: " + n.sid);
  137. break;
  138. case FOLLOWING:
  139. case LEADING:
  140. /*
  141. * Consider all notifications from the same epoch
  142. * together.
  143. */
  144. if(n.electionEpoch == logicalclock){
  145. recvset.put(n.sid, new Vote(n.leader,
  146. n.zxid,
  147. n.electionEpoch,
  148. n.peerEpoch));
  149. if(ooePredicate(recvset, outofelection, n)) {
  150. self.setPeerState((n.leader == self.getId()) ?
  151. ServerState.LEADING: learningState());
  152. Vote endVote = new Vote(n.leader,
  153. n.zxid,
  154. n.electionEpoch,
  155. n.peerEpoch);
  156. leaveInstance(endVote);
  157. return endVote;
  158. }
  159. }
  160. /*
  161. * Before joining an established ensemble, verify
  162. * a majority is following the same leader.
  163. */
  164. outofelection.put(n.sid, new Vote(n.version,
  165. n.leader,
  166. n.zxid,
  167. n.electionEpoch,
  168. n.peerEpoch,
  169. n.state));
  170. if(ooePredicate(outofelection, outofelection, n)) {
  171. synchronized(this){
  172. logicalclock = n.electionEpoch;
  173. self.setPeerState((n.leader == self.getId()) ?
  174. ServerState.LEADING: learningState());
  175. }
  176. Vote endVote = new Vote(n.leader,
  177. n.zxid,
  178. n.electionEpoch,
  179. n.peerEpoch);
  180. leaveInstance(endVote);
  181. return endVote;
  182. }
  183. break;
  184. default:
  185. LOG.warn( "Notification state unrecognized: {} (n.state), {} (n.sid)",
  186. n.state, n.sid);
  187. break;
  188. }
  189. } else {
  190. LOG.warn( "Ignoring notification from non-cluster member " + n.sid);
  191. }
  192. }
  193. return null;
  194. } finally {
  195. try {
  196. if( self.jmxLeaderElectionBean != null){
  197. MBeanRegistry.getInstance().unregister(
  198. self.jmxLeaderElectionBean);
  199. }
  200. } catch ( Exception e) {
  201. LOG.warn( "Failed to unregister with JMX", e);
  202. }
  203. self.jmxLeaderElectionBean = null;
  204. }
  205. }

选举流程图如下:

快速选举流程
标题

上面讲解了快速的选举流程,那么选举中的数据是怎么交互的呢,下面来进行进一步的讲解:

在zookeeper的启动脚本zkServer.cmd可以看到有这么一行脚本内容:


 
 
  1. set ZOOMAIN=org.apache.zookeeper. server.quorum.QuorumPeerMain
  2. echo on
  3. call %JAVA% "-Dzookeeper.log.dir=%ZOO_LOG_DIR%" "-Dzookeeper.root.logger=%ZOO_LOG4J_PROP%" -cp "%CLASSPATH%" %ZOOMAIN% "%ZOOCFG%" %*

我们得知启动类为:org.apache.zookeeper.server.quorum.QuorumPeerMain,跟踪代码可以得知选举流程为:

FastLeaderElection类中的lookForLeader()方法,实际发生网络交互的地方为QuorumCnxManager类,类图关系如下两图:

网络交互类图

具体说明:
         QuorumCnxManager类为实际发生网络交互的地方,负责网络通讯中收集与发送投票信息,有类图关系中可以看到此类中有个叫Listener的内部类,此类负责保证连接的一对一以及启动两个线程进行投票消息的收发:sendWorker和recvWorker;
         FastLeaderElection类中也有两个内部类负责投票信息的收发:WorkerSender和WorkerReceiver。
         消息发送条线:选举方法lookForLeader()中发送投票时是将投票信息放入FastLeaderElection类中的sendqueue队列中,而WorkerSender(FastLeaderElection):负责将sendqueue队列中的信息放入QuorumCnxManager类中的queueSendMap中;而sendWorker(QuorumCnxManager):负责将QuorumCnxManager类中的queueSendMap中的投票信息发送到网络上。

         消息接收条线:recvWorker(QuorumCnxManager):负责接收网络上的投票信息,并放入QuorumCnxManager类的recrQueue队列中;WorkerReceiver(FastLeaderElection):负责从QuorumCnxManager类中的recrQueue队列中获取数据,并放入FastLeaderElection类中的recvqueue队列中。

自己拷贝了一份3.4.9的源码并添加了些许注释:https://github.com/learnertogether/zookeeper-3.4.9.git

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值