zookeeper-一个简单又复杂的东西

最新推荐文章于 2022-05-02 15:32:16 发布

韩哥123456

最新推荐文章于 2022-05-02 15:32:16 发布

阅读量546

点赞数 1

分类专栏： zookeeper 文章标签：分布式 zookeeper java 算法

本文链接：https://blog.csdn.net/u014274324/article/details/107289256

版权

zookeeper 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

zookeeper为什么说是一个简单又复杂的东西，复杂是指从理论上上来看，真的很复杂，很多人根本看不懂，为什么又说简单呢？简单是指从代码层面上来说，实现理论并不复杂，反而异常的清晰。下面说下复杂的东西：paxos。

1:paxos

paxos统治了现在基本上所有一致性算法的理论基础，chubby和zookeeper都是以paxos为理论的一致性算法，但是由于paxos不容易实现和即使实现在使用当中也会出现一些问题，基本上没有中间件完全的用paxos，基本上都是在理论基础上的变种。下面简单的说一下paxos：

paxos有三种角色：Proposer，Acceptor，Learner。由于Learner并不参加选举，这里就不多说了。Proposer负责提出议案，Acceptor负责对提案进行投票。算法将围绕下面的图片进行聊：

从上面的图也可以看出来：

阶段 1
- a) proposer向网络内超过半数的acceptor发送prepare消息
- b) acceptor正常情况下回复promise消息
阶段 2
- a) 在有足够多acceptor回复promise消息时，proposer发送accept消息
- b) 正常情况下acceptor回复accepted消息

下面说第一种情况：

一阶段(prepare)：

A1在1发出prepare（maxn=1）发出提议，A3在接收到提议之后，自己的maxn=null，把自己的maxn=1，在2回复A1 ok，promise（maxn=1）；其他的A2，A4，A5也和A3一样的回复A1 ok，promise（maxn=1）；A1收到过半的promise回复，开启二阶段的accept过程。

二阶段（accept）：

A1在3发出accept（maxn=1，value=10%），A3收到accept之后，比较自身的maxn=1，等于A1发出的accpet的maxn=1，在4回复A1ok，accepted（maxn=1，value=10%），其他的A2，A4，A5也和A3一样的回复A1 ok，accepted（maxn=1，value=10%），A1收到过半的accepted回复之后，开启之后的leaner过程。

这是最顺利的过程，现实往往不会如此，会存在异常情况，下面分析第二种情况：

一阶段(prepare)：

一阶段和第一种情况的没有什么区别，主要在二阶段上。

二阶段（accept）：

A5在5发出prepare（maxn=2）发出提议，A3在接受到提议之后，自己的maxn=1<2,把自己的maxn=2，在6回复A5 ok，promise（maxn=2，value=10%），并带上次accepted的maxn=1，假设A1，A2，A4也和A3一样，回复A5 ok，promise（maxn=2，value=10%），并带上次accepted的maxn=1，这样A5得到半数的promise回复，A5发出accepted（2，10%），同时A1也收到了半数的accepted请求，这只是理想情况，最终的结果就是A1把accept（maxn=1，value=10%）和A5把（maxn=2，value=10%）都收到了过半回复，value=10%都被传播了。

这种情况也很顺利，并没有进行二次选举，只是10%被二次accepte，这也是理论的问题所在，不可控。后面zab就有了leader的概念。下面说下第三种情况：

一阶段(prepare)：

一阶段和第二种情况的没有什么区别，主要在二阶段上。

二阶段（accept）：

A5在5发出prepare（maxn=2）发出提议，A3在接受到提议之后，自己的maxn=1<2,把自己的maxn=2，在6回复A5 ok，promise（maxn=2，value=10%），并带上次accepted的maxn=1，假设现在A2和A4都和A3一样接受了A5发出的prepare（maxn=2）的提议，但是这个时候A2，A4，A5这个时候和A3不一样，这个时候A2，A4，A5，还没有收到A1的accept（maxn=1，value=10%），在7的时候A1发出了accept（maxn=1，value=10%），但是这个时候A2，A4，A5，的maxn=2，所以A1被A2，A4，A5在8的reject拒绝，不得不进行下次的promise（3），而A5得到A2，A4的promise（maxn=2，value=null），A3的promise（maxn=2，value=10%），得到了多数的promise回复，挑选最大的maxn的value，这里选择value=10%，也可能不是，所以进行accept（maxn=2，value=10%或者value=20%），然后又回到了之前，所以这种就会循环，也就是大家所熟知的活锁。其他的情况可以以此类推，不会有太大变化，很类似，不再说其他情况。

说到这里paxos就结束了，我自己都感觉复杂，对现实情况实现起来更是没有太大的头绪，这也是这个算法的通病，basic paxos难以理解，难以实现。线面到正题了，zookeeper的zab，在paxos理论上变种，单纯的理论意义如果不能实现，那么就没有太大的意义。

2:zookeeper:zab协议

zab协议分为三个部分：崩溃恢复模式，数据同步模式，消息广播模式，也有人把崩溃恢复和数据同步列为一起(一阶段)，消息广播模式（二阶段），不影响大家理解。

2.1:崩溃恢复模式

1：初始化投票，每个人都把选票投给自己，（1，1）代表server1把选票投给server1（自己），（2，2）代表server2把选票投给server2（自己），（3，3）代表server1把选票投给server3（自己）

2：发送初始化选票，server1分别发送给server2（1，1）和server3（1，1），server2发送给server1（2，2）和server3（2，2），server3发送给server1（3，3），server2（3，3）

3：接受外部投票，server2和server3在这里接受server1的投票，server1和server3接受server2的投票，server1和server2接受server3的投票。

4：更新选票，更新选票的原则是闲判断全局的时钟（epoch），如果epoch相等，再判断zx_id，事务id的大小，如果zx_id一样大，就判断server_id（my_id），服务器的编码，每个服务器都有一个id，因为是一开始启动服务器，所以只需要比较server_id(my_id)就行，由于server1的my_id小于server2的my_id，由于server2的my_id小于server3的my_id，server1的接到server2的（2，2），变更自己的投票变成（1，2），（2，2），接到server3的投票变成（1，3），（2，2），（3，3），server2接到server1的投票，不进行变更，此时投票（1，1），（2，2），接到server3的投票变成（1，1），（2，3），（3，3），server3的投票接到server1和server2的没有变化，此时（1，1），（2，2），（3，3）。

server1（1，3），（2，2），（3，3）

server2（1，1），（2，3），（3，3）

server3（1，1），（2，2），（3，3）

由于server1和server2的投票变更了，需要把变更的投票投出去，所以server1要吧（1，2）和（1，3）投票投出去，我不列举（1，2）了，只列举（1，3），server1投出去（1，3），server2投出去（2，3），server3没变更投票，无需继续投

5:发送变更的选票，server1发送给server2和server3（1，3），server2发送给server1和server3（2，3），server3不需要发送投票。

5：接着更新选票，server1投给server2和server3都不需要变更投票，只需要记录新的投票，server2投给server1的也不需要变更投票，只需要更新记录，server1接到server2（2，3）的投票，进行更新（1，3）（2，3）（3，3），server2接到server1的投票，更新投票（1，3）（2，3）（3，3），server3接到server1和server2的投票，进行更新（1，3）（2，3）（3，3）

server1（1，3）（2，3）（3，3）

server2（1，3）（2，3）（3，3）

server3（1，3）（2，3）（3，3）

都没有进行变更投票，至此结束了，最终的结果：

从上面可以看出server3获得了大多数机器的统一，成为了leader。选举的代码类是FastLeaderElection，代码如下，感兴趣的自己看下，我就不分析了，和上面基本完全一样：

public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = System.currentTimeMillis();
        }
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock++;
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                else if(self.getVotingView().containsKey(n.sid)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock) {
                            logicalclock = n.electionEpoch;
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock, proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock,
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock = n.electionEpoch;
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    LOG.warn("Ignoring notification from non-cluster member " + n.sid);
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

2.2:数据同步模式

zookeeper缓存了最近的500个请求，为了方便数据同步，所以数据同步也是以此为基础的，minCommintedLog和maxCommintedLog，代表缓存的最大值和最小值，以此区间判断怎么同步，很多主流的中间件都有这个东西。

1：follower 的 peerLastZxid 介于 maxCommittedLog， minCommittedLog 两者之间

1.1：follower 的 peerLastZxid 等于 leader 的 peerLastZxid，follower 与 leader 数据一致，采用 DIFF 方式同步，

也即是无需同步。

1.2：follower 的 peerLastZxid 小于 leader 的 peerLastZxid，peerLastZxid在leader节点存在，采用 DIFF 方式同步

leader 0x500000001, 0x500000002, 0x500000003, 0x500000004, 0x500000005，0x600000001

follower peerLastZxid为0x500000003

只需要把0x500000004和0x500000005，0x600000001发送给follower

1.3：follower 的 peerLastZxid 小于 leader 的 peerLastZxid，peerLastZxid在leader节点不存在。

采用 TRUNC+DIFF(先回滚再差异化同步)。

leader 0x500000001, 0x500000002, 0x500000003, 0x500000004, 0x500000005，0x600000001

follower peerLastZxid为0x500000006

follower把0x500000006进行回滚，然后把0x600000001发送给follower

2：follower 的 peerLastZxid 大于 leader 的 maxCommittedLog，则告知 follower 回滚至 maxCommittedLog；

该场景可以认为是 TRUNC+DIFF 的简化模式

3：follower 的 peerLastZxid 小于 leader 的 minCommittedLog 或者 leader 节点上不存在提案缓存队列时，

将采用 SNAP 全量同步方式。该模式下 leader 首先会向 follower 发送 SNAP 报文，随后从内存数据库中获取全量数据序

列化传输给 follower， follower 在接收全量数据后会进行反序列化加载到内存数据库中。

2.3:消息广播模式

1）客户端发起一个写操作请求。

2）Leader 服务器将客户端的请求转化为事务 Proposal 提案，同时为每个 Proposal 分配一个全局的ID，即zxid。

3）Leader 服务器为每个 Follower 服务器分配一个单独的队列，然后将需要广播的 Proposal 依次放到队列中取，并且根据 FIFO 策略进行消息发送。

4）Follower 接收到 Proposal 后，会首先将其以事务日志的方式写入本地磁盘中，写入成功后向 Leader 反馈一个 Ack 响应消息。

5）Leader 接收到超过半数以上 Follower 的 Ack 响应消息后，即认为消息发送成功，可以发送 commit 消息。

6）Leader 向所有 Follower 广播 commit 消息，同时自身也会完成事务提交。Follower 接收到 commit 消息后，会将上一条事务提交。

在这里我是要贴一段足够经验的代码：

public void run() {
        try {
            int logCount = 0;

            // we do this in an attempt to ensure that not all of the servers
            // in the ensemble take a snapshot at the same time
            setRandRoll(r.nextInt(snapCount/2));
            while (true) {
                Request si = null;
                if (toFlush.isEmpty()) {
                    si = queuedRequests.take();
                } else {
                    si = queuedRequests.poll();
                    if (si == null) {
                        flush(toFlush);
                        continue;
                    }
                }
                if (si == requestOfDeath) {
                    break;
                }
                if (si != null) {
                    // track the number of records written to the log
                    if (zks.getZKDatabase().append(si)) {
                        logCount++;
                        if (logCount > (snapCount / 2 + randRoll)) {
                            randRoll = r.nextInt(snapCount/2);
                            // roll the log
                            zks.getZKDatabase().rollLog();
                            // take a snapshot
                            if (snapInProcess != null && snapInProcess.isAlive()) {
                                LOG.warn("Too busy to snap, skipping");
                            } else {
                                snapInProcess = new ZooKeeperThread("Snapshot Thread") {
                                        public void run() {
                                            try {
                                                zks.takeSnapshot();
                                            } catch(Exception e) {
                                                LOG.warn("Unexpected exception", e);
                                            }
                                        }
                                    };
                                snapInProcess.start();
                            }
                            logCount = 0;
                        }
                    } else if (toFlush.isEmpty()) {
                        // optimization for read heavy workloads
                        // iff this is a read, and there are no pending
                        // flushes (writes), then just pass this to the next
                        // processor
                        if (nextProcessor != null) {
                            nextProcessor.processRequest(si);
                            if (nextProcessor instanceof Flushable) {
                                ((Flushable)nextProcessor).flush();
                            }
                        }
                        continue;
                    }
                    toFlush.add(si);
                    if (toFlush.size() > 1000) {
                        flush(toFlush);
                    }
                }
            }
        } catch (Throwable t) {
            handleException(this.getName(), t);
            running = false;
        }
        LOG.info("SyncRequestProcessor exited!");
    }

这段代码说的什么，说的是如果我们提交一个写请求，如果此时系统很空闲，那么我们进行提交写操作，如果此时系统很频繁的工作，就积攒写请求到1000个，进行处理，以前我认为的实时写操作，写完一个执行完，把结果返回用户，但是实际上的都是异步的，我能想到网络I/O上的异步，却没想到这么执行写操作，开拓了眼界。值得吐槽的地方是：zk对文件的操作有点low，把数据保存的时候序列化一下就算完了，所有的数据加载到内存里，没有所谓的索引常驻内存一说，就是一个大字典，所以zk吃内存，这个东西和redis有的比，真的很像，特别像，只不过redis还有一些出色的数据结构。

这篇只说了怎么选举，怎么保证一致性，对于zk的一些基本知识：

PERSISTENT 持久化节点

PERSISTENT_SEQUENTIAL 顺序自动编号持久化节点，这种节点会根据当前已存在的节点数自动加 1

EPHEMERAL 临时节点，客户端session超时这类节点就会被自动删除

EPHEMERAL_SEQUENTIAL 临时自动编号节点

这些东西我没有说，还有leader，follower，observer这些概念没有说，zookeeper为什么使用EPHEMERAL节点做分布式锁，没用用过的还是要看下这些内容，我用过zk，但是大型的集群也没使用过，可能有地方理解有误差。这篇真的很难写，很多东西我都是一点点梳理的，太耗时间了，本来想再写一片redis的，时间不够了。下个周写下redis吧。

3：分布式一致性（分布式事务（2pc））和共识性算法（Zookeeper）关系

分布式一致性（consistency）和共识性算法（consensus）在概念上的确是二个东西，却有着千丝万缕的关系，主要分析下他们之间的联系和区别。

3.1：2PC分布式事务

这里只说下2pc，不说3pc了，2pc大体分为2个过程，3个步骤

1：preCommit（预提交）

1.1：事务询问

协调者向所有参与者发送事务询问

1.2：执行事务

参与者执行事务，事务成功，写入redo log 和 undo log，并不进行commint操作，返回协调者事务成功，如果事务失败，

返回给协调者事务失败

2：doCommit

2.1：如果所有参与者返回成功，进行commit，如果有任何一个返回失败，进行rollback操作。

如果协调者挂了怎么办？一直阻塞，一直不可用，能不能再选出来一个？所有的参与者都得回复完才能提交？如果网络出问题了怎么办？异地机房网络波动很可能发生，怎么办？

3.2：共识性算法zab

上面我们提到的问题如果碰到，都是一个灾难性的问题，这里称zookeeper为公共性算法，而不是一致性算法，是有原因的。

协调者：leader相当于2pc的协调者

参与者：follower相当于2pc的参与者

为了解决2pc协调者贮机之后，整个系统不可用的状态，所以让大家重新选出一个leader，这就需要超过半数参与者达成共识，

这就是选举的意义所在，之后就是进行2pc了，

1：preCommit（预提交）

1.1：事务询问

leader向所有follower发送事务询问

2.1：执行事务

follower执行事务，事务成功，写入txn log，并不进行commint操作，返回leader事务成功，如果事务失败，

返回给leader事务失败

2：doCommit

2.1：如果参与者中的超过一半返回成功，进行commit，否则就不会进行commit，类似rollback操作。

为什么是超过一半？同城多机房的时候，不会出现脑裂的问题，特别是异地多机房的应用，请尽量不要使用zk，

可能集群因为网络延迟，大多数时间都在选举和数据复制。

总结一下，zab为什么能够实现一致性，这里指最终一致性，可以说zab是一个二阶段2pc的协议，但也可以说不是，

不同的人有不同的理解，我认为zab是为了2pc为了实现可用性的进阶版，解决了2pc在理论上存在的致命问题。

韩哥123456

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
zookeeper-一个简单又复杂的东西

zookeeper为什么说是一个简单又复杂的东西，复杂是指从理论上上来看，真的很复杂，很多人根本看不懂，为什么又说简单呢？简单是指从代码层面上来说，实现理论并不复杂，反而异常的清晰。下面说下复杂的东西：paxos。1:paxospaxos统治了现在基本上所有一致性算法的理论基础，chubby和zookeeper都是以paxos为理论的一致性算法，但是由于paxos不容易实现和即使实现在使用当中也会出现一些问题，基本上没有中间件完全的用paxos，基本上都是在理论基础上的变种。下面简单的说一下paxo
复制链接

扫一扫