Zookeeper选举算法和流程

转圈圈的驴

已于 2023-03-14 15:01:37 修改

阅读量1.4k

点赞数 2

分类专栏：分布式微服务微服务文章标签： java zookeeper 分布式微服务数据库

于 2020-12-14 23:48:10 首次发布

本文链接：https://blog.csdn.net/weixin_45192350/article/details/111188437

版权

分布式微服务同时被 2 个专栏收录

1 篇文章 1 订阅

订阅专栏

微服务

1 篇文章 0 订阅

订阅专栏

Zookeeper

1.zookeeper是什么？

zookeeper是一个为分布式应用提供服务的一个中间件。分布式中CAP的问题，他可以帮助我们实现CP，也就是一致性和分区容错性。

简单理解：
它类似文件系统，数据结构是节点类型的（一切皆节点），它的节点有点像操作系统的目录，但是它的节点可以直接存储数据。
应用场景：分布式锁，分布式配置中心，注册中心。

Zab协议： 所有来自Client的写请求，都要转发给Leader，有Leader根据请求发起一起Proposal（提议），然后其他的Follower发起ack（投票），当投票的数量过半时，
Leader会向所有的Server发送一个通知消息。然后其他Server收到通知，就会认为这次的请求通过，更新内存中的数据。其中Client所连接的那台还需要做出响应。

2.选举算法和流程是怎样的？

选择优秀的服务器当leader，优秀的定义：
1.谁的数据最新谁最优秀，也就是zxid大的优秀
2.如果zxid相同，就比较myid，myid大的优秀

0.自增选举轮次
Zookeeper规定了所有有效的投票的都必须在同一轮次中，所以在开启投票之前，每台服务器都会对logiclclock进行自增操作。

1.发起投票
每个服务器启动的时候都是LOOKING状态，这个时候会先给自己投票，比且把投票结果发送给其他服务器

2.接收投票
electionEpoch，logicalclock 前者是别人传递过来的，后者是自己。都表示选举轮次。
1.electionEpoch > logicalclock ，说明自己落后，则跟上 logicalclock=electionEpoch，然后放弃自己的投票，再使用初始化投票来PK，重新发送投票结果。
2.electionEpoch < logicalclock，说明别人轮次落后了，则不需要理会这个投票。
3.electionEpoch = logicalclock, 说明轮次一样，就进入PK

3.PK
就是根据上面的规则，选择优秀的服务器作为leader，如果别人优秀，就把自己这一票从投自己改为投别人。

4.统计
每次投票后，都会去判断一下是否有过半的机器投票给同一台机器，如果有，则认为那台机器是准leader

5.检查余票
选出准leader之后，还有一步判断，继续获取其他服务器的投票，再判断谁更优秀，如果新投出来的更优秀，就会重新投票。

6.改变状态
在检查余票期间是循环一直获取新的投票，直到没有再收到其他人投票，那么之前准leader就会把自己设置为leader。
其他服务器根据配置文件，设置自己为follower或者是observer
补充：运行的过程中，如果leader挂掉之后，follower由于同步数据到leader的时候会出错，此时follower就会把自己状态改成looking，之后就会重新选举。
下面附上领导者选举的部分源码
FastLeaderElection#lookForLeader()

case LOOKING:
    // If notification > current, replace and send messages out
    if (n.electionEpoch > logicalclock.get()) {
        // 如果别人的选举周期比自己大，则把自己的时钟设置为跟别人一样
        logicalclock.set(n.electionEpoch);
        // 放弃自己的之前的选票
        recvset.clear();
        // 比较谁数据比较新,就投给谁
        if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
            updateProposal(n.leader, n.zxid, n.peerEpoch);
        } else {
            updateProposal(getInitId(),
                    getInitLastLoggedZxid(),
                    getPeerEpoch());
        }
        // 重新发送
        sendNotifications();
    } else if (n.electionEpoch < logicalclock.get()) {
        // 如果自己的时钟比别人的投票周期大，则不处理他这个票
        if(LOG.isDebugEnabled()){
            LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                    + Long.toHexString(n.electionEpoch)
                    + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
        }
        break;
    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
            proposedLeader, proposedZxid, proposedEpoch)) {
        // 这里就是投票周期相等，并且别人服务器数据比较新，通过epoch和zxid和myid比较
        // 更新自己的投票信息
        updateProposal(n.leader, n.zxid, n.peerEpoch);
        // 发送出去
        sendNotifications();
    }

    if(LOG.isDebugEnabled()){
        LOG.debug("Adding vote: from=" + n.sid +
                ", proposed leader=" + n.leader +
                ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
    }

    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
    // 接收别人的选票以及自己的投票，判断能不能成为准leader(过半机制)
    if (termPredicate(recvset,
            new Vote(proposedLeader, proposedZxid,
                    logicalclock.get(), proposedEpoch))) {

        // Verify if there is any change in the proposed leader
        // 确认新领导是否有变动
        while((n = recvqueue.poll(finalizeWait,
                TimeUnit.MILLISECONDS)) != null){
            // 这里循环去取再来看下还有没有新的投票过来，如果有，并且新的投出来这个服务器更加优秀就会把他放到recvqueue队列中，然后跳出循环，重新选举

            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                    proposedLeader, proposedZxid, proposedEpoch)){
                recvqueue.put(n);
                break;
            }
        }

        /*
         * This predicate is true once we don't read any new
         * relevant message from the reception queue
         */
        if (n == null) {
            // 如果没有新的投票了，那就判断当前的投出来的sid是否等于自己，是就把自己设置成leader，否则就根据配置文件设置为follower或者observer
            self.setPeerState((proposedLeader == self.getId()) ?
                    ServerState.LEADING: learningState());

            Vote endVote = new Vote(proposedLeader,
                                    proposedZxid,
                                    logicalclock.get(),
                                    proposedEpoch);
            // 将最终的投票结果返回
            leaveInstance(endVote);
            return endVote;
        }
    }
    break;

服务器之间通信
QuorumCnxManager
每台服务器在启动的时候都会启动QuorumCnxManager，它的作用就是各个服务器之间选举的时候通信管理。

1.消息队列
public class QuorumCnxManager
Map<Long, SendWorker> senderWorkerMap; // 每台服务器对应的SendWordker sid:SendWorker

Map<Long, ArrayBlockingQueue> queueSendMap; // 需要发送给各个服务器的消息队列 sid:queue

Map<Long, ByteBuffer> lastMessageSent; // 发送给每台服务器最近的消息 sid:lastMessageSent

ArrayBlockingQueue recvQueue; // 本台服务器接收到的消息

2.建立连接

QuorumCnxManager启动的时候会创建一个ServerSocket连接，为了避免两台服务器重复创建TCP连接，Zookeeper做了一个判断，如果连过来的服务器的sid比自己小，
则断开连接，自己再主动向该服务器发起连接。一旦连接建立，就会根据其他服务器的sid来创建上面几个Map以及消息接收器recvQueue

3.消息发送
Zookeeper为每台远程服务器都分配一个单独的SendWorker，SendWorker是单独的一个线程。
SendWorker的中，是不断的从queueSendMap中针对当前服务器的消息发送队列，判断是否为空，为空则从lastMessageSent中取出最近一次发送过的消息来再次发送，
这样是为了保证，最后一次发送出去的消息确保对方能收到，同时接收方会处理好幂等性，避免重复消息发生的影响。

SendWork部分源码加注释

@Override
public void run() {
    threadCnt.incrementAndGet();
    try {
        /**
         * If there is nothing in the queue to send, then we
         * send the lastMessage to ensure that the last message
         * was received by the peer. The message could be dropped
         * in case self or the peer shutdown their connection
         * (and exit the thread) prior to reading/processing
         * the last message. Duplicate messages are handled correctly
         * by the peer.
         *
         * If the send queue is non-empty, then we have a recent
         * message than that stored in lastMessage. To avoid sending
         * stale message, we should send the message in the send queue.
         * 如果队列中没有要发送的消息，则发送lastMessage以确保对等方收到了最后一条消息。如果self或对等端在读取/处理最后一条消息之前关闭它们的连接(并退出线程)，则可以删除该消息。对等方正确地处理重复消息。
         */
        ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
        if (bq == null || isSendQueueEmpty(bq)) {
           ByteBuffer b = lastMessageSent.get(sid);
           if (b != null) {
               LOG.debug("Attempting to send lastMessage to sid=" + sid);
               send(b);
           }
        }
    } catch (IOException e) {
        LOG.error("Failed to send last message. Shutting down thread.", e);
        this.finish();
    }
}

4.消息接收
Zookeeper为每个远程服务器都分配单独一个RecvWorker，RecvWorker是单独的一个线程，它会一直从TCP连接中读取消息，读取到的消息加入到recvQueue队列中。

RecvWorker源码

@Override
public void run() {
    threadCnt.incrementAndGet();
    try {
        while (running && !shutdown && sock != null) {
            /**
             * Reads the first int to determine the length of the
             * message
             */
            int length = din.readInt();
            if (length <= 0 || length > PACKETMAXSIZE) {
                throw new IOException(
                        "Received packet with invalid packet: "
                                + length);
            }
            /**
             * Allocates a new ByteBuffer to receive the message
             */
            byte[] msgArray = new byte[length];
            din.readFully(msgArray, 0, length);
            ByteBuffer message = ByteBuffer.wrap(msgArray);
            addToRecvQueue(new Message(message.duplicate(), sid));
        }
    } catch (Exception e) {
        LOG.warn("Connection broken for id " + sid + ", my id = "
                 + QuorumCnxManager.this.mySid + ", error = " , e);
    } finally {
        LOG.warn("Interrupting SendWorker");
        sw.finish();
        if (sock != null) {
            closeSocket(sock);
        }
    }
}

3.Zookeeper对节点的watch监听通知是永久的吗？

不是，我们使用原生Zookeeper客户端，绑定watch的时候，这个watch只被回调一次就会失效了，通过源码可以知道，Client这边是直接map.remove返回watch，然后再执行的。

4.集群一般配置多少台机器？

2n+1台。因为过半机制的影响，通过源码可以知道，过半机制的判断是要大于 n / 2的。所以比如4台和3台的效果其实是一样的，都是只能容忍挂掉一台。

5.集群支持动态添加机器吗？

不行。因为需要修改配置文件。Zookeeper的集群数量，是启动时根据配置文件配置的server数量决定的。这也是过半机制的这个数量来设计的。
所以需要添加机器的时候，是需要每一台机器的配置文件都需要修改，然后逐台重启的。

6.如果一台服务器挂掉，重启的时候数据是怎么同步的？

但是如果是之前在集群中的服务器挂掉时，重启就可以，启动时这台服务器也会自己投票给自己，但是这个时候已经有leader了，其他服务器会通知到它谁是leader，之后再同步数据。
ps：多个事务操作之后才会打一次快照。所以快照里面的数据不一定的最新的，但是打完快照之后的事务操作都记录在一个list里面。
在同步数据过程中，就是Follower启动的过程，这个时间段客户端是不可以访问的。
同步数据的方式：通过快照数据先同步一部分，比快照更新的数据，通过一个list(committedLog)来获取。
大概流程如下
Leader:
1.找到要同步的数据
2.发送给指定的follower
3.发送完毕之后发送UPTODATE指令
Follower：
1.接收leader发送过来要同步的数据
2.循环等待UPTODATE指令
3.收到UPTODATE指令表示数据同步完，break退出循环。

转圈圈的驴

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Zookeeper选举算法和流程

Zookeeper1.zookeeper是什么？zookeeper是一个为分布式应用提供服务的一个中间件。分布式中CAP的问题，他可以帮助我们实现CP，也就是一致性和分区容错性。简单理解：它类似文件系统，数据结构是节点类型的（一切皆节点），它的节点有点像操作系统的目录，但是它的节点可以直接存储数据。应用场景：分布式锁，分布式配置中心，注册中心。Zab协议：所有来自Client的写请求，都要转发给Leader，有Leader根据请求发起一起Proposal（提议），然后其他的Follower发起
复制链接

扫一扫