细读经典第一期——从Paxos到Zookeeper 分布式一致性原理与实践（4）

最新推荐文章于 2024-07-25 19:44:08 发布

TXL永远爱WXY

最新推荐文章于 2024-07-25 19:44:08 发布

阅读量113

点赞数

分类专栏：细读经典文章标签： zookeeper

本文链接：https://blog.csdn.net/u014783007/article/details/120839694

版权

细读经典专栏收录该内容

13 篇文章 0 订阅

订阅专栏

7.1.3 Znode的版本信息——保证分布式数据操作的原子性

7.1.4 Watcher——数据变更通知

七、Zookeeper技术内幕

ZK最重要的一点就是利用自己的数据模型实现分布式系统的一致性

7.1系统模型

7.1.1数据模型

Znode数据节点是最小单元，由Znode构成了树。

对于每一个事务请求，Zookeeper都会为其分配一个全局的事务ID，用ZXID来表示，每个ZXID对应一次更新操作，ZXID用来保证全局顺序操作

7.1.2 节点类型

持久节点（PERSISITENT），临时节点（EPHEMERAL）和顺序节点（SEQUENTIAL），具体创建过程中，会形成

持久节点，持久顺序节点，临时节点，临时顺序节点四类，其中临时节点只能作为叶子节点，且与会话绑定，而非TCP连接

7.1.3 Znode的版本信息——保证分布式数据操作的原子性

Znode的版本信息用来保证分布式数据的原子性操作，每个数据节点有三个版本信息

version表示当前节点自从创建之后被更新的次数，即使数据一样，只要版本发生变化，version就会改变，例如当一个Znode被创建后，其version为0，同样，这样的改变也会存在所谓ABA问题，所以这里的版本并不是指Znode的本身的内容是否变化，而是是否有用户对节点内容进行变更。

乐观锁控制事务分成如下三个阶段：数据读取，写入校验，和数据写入

Znode在进行setDataRequest时会进行版本比较，客户端可以使用CAS，也可以不使用CAS，如果使用但版本不一致，会抛出BadVersionException异常

7.1.4 Watcher——数据变更通知

客户端向服务端注册Watcher监听，当服务端一些指定的事件触发了Watcher，那么就会向指定客户端发送一个事件通知来实现分布式通知功能

Client向zookeeper注册Watcher时，会将Watcher信息存储在WatchManager中，当服务端出发Watcher事件之后，会向客户端发送通知，客户端线程从WatchManager中取出Watcher执行相应的回调逻辑。

Watcher接口内定义了两个枚举：KeeperState和EventType和一个方法process（WatchedEvent event）

本质上，还是在服务端维护了一个Map<String,Set<Watcher>>，将节点路径和该节点路径的Set集合进行映射。

总结Watcher特性：

一次性：一旦一个Watcher被触发，就会从Set<Watcher>中移除该Watcher，因此在使用Watcher时，要确定是否需要在Watcher触发后再次注册Watcher。
轻量：WatchedEvent是整个Zookeeper做Watcher的最小通知单元，且只包含三个成员变量，也就是说process回调只会告诉客户端发生了事件，而不会说明事件的具体内容，对于变更前后的数据都需要客户端自己去获取。从而做到轻量级的通知机制。

7.1.5 ACL访问控制

UGO（user，group，others）转为ACL（Scheme:id:permission）

7.2序列化协议

Zk采用jute序列化组件

7.3 客户端

核心部件：

Zookeeper实例：ClientWatcherManager客户端WatcherManager

HostProvider：服务器地址管理器

ClientCnxn：客户端核心线程，其中又包含SendThread（用于建立TCP通信）和EventThread（事件处理线程），ZK客户端创建一次会话的过程：

7.4 会话

7.4.1 会话状态转换

图片来源：[ZooKeeper]ZooKeeper的会话状态_zjysource的专栏-CSDN博客_zookeeper会话状态

会话状态CONNECTING，CONNECTED，RECONNECTING，RECONNECTED，CLOSE。

7.4.2 会话创建

（1）Session客户端会话实体

    interface Session {
        // 用sessionId唯一标识一个会话
        long getSessionId();
        // 超时时间
        int getTimeout();
        // 是否已经关闭
        boolean isClosing();
    }

sessionId生成规则，左移24位是为了把二进制日期前的0都移除，之后无符号右移8位，把高8位给id腾出位置，之后把id按位与在64位时间的高8位，最终得一个根据时间唯一确定的sessionId。（每次一有时间生成id我就想到时钟回拨。。）

    public static long initializeNextSessionId(long id) {
        long nextSid;
        nextSid = (Time.currentElapsedTime() << 24) >>> 8;
        nextSid = nextSid | (id << 56);
        if (nextSid == EphemeralType.CONTAINER_EPHEMERAL_OWNER) {
            ++nextSid;  // this is an unlikely edge case, but check it just in case
        }
        return nextSid;
    }

最后，为了管理session，ZK提供了SessionTracker接口进行管理。

// sessionId和session的映射
protected final ConcurrentHashMap<Long, SessionImpl> sessionsById = new ConcurrentHashMap<Long, SessionImpl>();
// session和session超时时间的映射
protected final ConcurrentMap<Long, Integer> sessionsWithTimeout;

7.4.3 会话管理

ZK的会话管理主要是由SessionTracker负责，采用“分桶策略” (不重要)

7.6 Leader选举

一台服务器在ZK集群选举中可能扮演的角色：Leader，Observer，Follower。一台服务器在整个选举的过程中可能存在的状态包括：Looking，Following，Leading，Observing

7.6.1选举概述

1、初始化集群时的Leader选择

ZK集群模式至少是2台服务器起，集群模式下需要在zoo.cfg文件的dataDir路径下，为当前机器创建myid，用来唯一标识集群中的这台ZK。这里以3台服务器为例，三台机器的myid分别为myid1，myid2和myid3。集群开始启动，当集群中仅有ZK1时，无法进行Leader选举，此时集群无Leader，当ZK2开始启动，并且与ZK1建立通信之后，集群可以开始选举Leader，进入Leader选举流程。

（1）每个Server发出投票，以（myid，ZXID）的形式进行投票，因为是初始化阶段，每台服务器都投给自己，ZXID为0，所以ZK1投票（1，0），ZK2投票（2，0）并将投票结果发给集群内的其他ZK；

（2）每台服务器接收投票，并验证投票可靠性（是否本轮，是否来自Looking状态的服务器）

（3）处理投票，每台服务器拿收到的投票和自己的投票比，比较ZXID，哪个版本高选择那个票当最终投票，如果ZXID相同，就比较myid，哪个大就选择哪个票当最终投票。所以对于当前投票，ZK1在收到ZK2的（2，0）投票之后，就把自己之前投票（1，0）改为（2，0），ZK2在收到ZK1投票之后，把自己原投票作为最终票，ZK1，ZK2再次向集群中所有ZK发送自己的最终投票结果。

（4）投票统计

投票之后，ZK1和ZK2均收到（2，0）投票，由于集群内节点数量为3，ZK1和ZK2均拿到两票（2，0），此时ZK2的投票结果已经大于半数（2/n+1，n=3），所以ZK1和ZK2均认为已经选出了Leader。

（5）更改服务器状态

在投票阶段，服务器状态均为Looking，一旦确定了Leader，服务器就会更改自己的状态，如果是Follower，则将自己的状态改为Following，如果是Leader，则将自己的状态改为Leading，如果是Observer则将自己的状态改为Observing。

2、服务器运行期间的Leader选举

当Zookeeper已经选举完Leader并正常运行后，非Leader节点的上下线并不会影响集群的Leader节点。但是一旦Leader节点挂了，那么集群将无法对外提供服务，集群将进入新一轮的Leader选举。

（1）变更状态

当Leader挂了之后，所有的非Observer节点会将自己的节点变更为Looking

（2）每个Server发起投票

同样生成投票（myid，ZXID），由于这是在运行期间，各个节点的ZXID可能不同，和初始化时同理，ZK1生成投票（1，122），ZK2生成投票（2，123）并将自己的投票进行广播

（3）各个ZK接受投票结果

（4）投票统计，同样是先比较ZXID，再比较myid，因此各个节点投票统一（2，123）

（5）更改各个服务器状态，Leader节点改为Leading，Follower节点改为Following，如果是Observer则将自己的状态改为Observing。

总结：ZXID越大越容易成为Leader，ZXID相同，myid越大成为Leader的概率越大

7.6.2 选举算法

老版本的ZK提供了三种选举算法，不过目前ZK仅仅保留了TCP版本的FastLeaderElection封装在类FastLeaderElection中

7.6.3 具体实现细节

下面看看源码：

先看投票

public class Vote {

    // 当前ZK的myid
    private final long id;
    // 当前ZK的事务ID ZXID
    private final long zxid;
    // 逻辑时钟，没赋值则默认为-1，每开始一轮投票+1，确保各个ZK节点收到的投票均在同一轮投票中
    private final long electionEpoch;
    // 被选举的Leader的epoch版本号
    private final long peerEpoch;
    // 当前ZK的状态，enum：LOOKING,FOLLOWING,LEADING,OBSERVING
    private final ServerState state;
}

之后就是投票算法实现：类FastLeaderElection，源码很长，看几个关键的方法。

    /**
     * Starts a new round of leader election. Whenever our QuorumPeer
     * changes its state to LOOKING, this method is invoked, and it
     * sends notifications to all other peers.
     */
    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }

        self.start_fle = Time.currentElapsedTime();
        try {
            /*
             * The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
             * if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
             * of participants has voted for it.
             */
            Map<Long, Vote> recvset = new HashMap<Long, Vote>();

            /*
             * The votes from previous leader elections, as well as the votes from the current leader election are
             * stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
             * Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
             * outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
             * the electionEpoch of the received notifications) in a leader election.
             */
            Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = minNotificationInterval;

            synchronized (this) {
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info(
                "New election. My id = {}, proposed zxid=0x{}",
                self.getId(),
                Long.toHexString(proposedZxid));
            sendNotifications();

            SyncedLearnerTracker voteSet = null;

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if (n == null) {
                    if (manager.haveDelivered()) {
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    notTimeout = Math.min(notTimeout << 1, maxNotificationInterval);

                    /*
                     * When a leader failure happens on a master, the backup will be supposed to receive the honour from
                     * Oracle and become a leader, but the honour is likely to be delay. We do a re-check once timeout happens
                     *
                     * The leader election algorithm does not provide the ability of electing a leader from a single instance
                     * which is in a configuration of 2 instances.
                     * */
                    if (self.getQuorumVerifier() instanceof QuorumOracleMaj
                            && self.getQuorumVerifier().revalidateVoteset(voteSet, notTimeout != minNotificationInterval)) {
                        setPeerState(proposedLeader, voteSet);
                        Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }

                    LOG.info("Notification time out: {} ms", notTimeout);

                } else if (validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the current or next
                     * voting view for a replica in the current or next voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        if (getInitLastLoggedZxid() == -1) {
                            LOG.debug("Ignoring notification as our zxid is -1");
                            break;
                        }
                        if (n.zxid == -1) {
                            LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                            break;
                        }
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                                LOG.debug(
                                    "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
                                    Long.toHexString(n.electionEpoch),
                                    Long.toHexString(logicalclock.get()));
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        LOG.debug(
                            "Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",
                            n.sid,
                            n.leader,
                            Long.toHexString(n.zxid),
                            Long.toHexString(n.electionEpoch));

                        // don't care about the version if it's in LOOKING state
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));

                        if (voteSet.hasAllQuorums()) {

                            // Verify if there is any change in the proposed leader
                            while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                                if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                setPeerState(proposedLeader, voteSet);
                                Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: {}", n.sid);
                        break;

                        /*
                        * In ZOOKEEPER-3922, we separate the behaviors of FOLLOWING and LEADING.
                        * To avoid the duplication of codes, we create a method called followingBehavior which was used to
                        * shared by FOLLOWING and LEADING. This method returns a Vote. When the returned Vote is null, it follows
                        * the original idea to break swtich statement; otherwise, a valid returned Vote indicates, a leader
                        * is generated.
                        *
                        * The reason why we need to separate these behaviors is to make the algorithm runnable for 2-node
                        * setting. An extra condition for generating leader is needed. Due to the majority rule, only when
                        * there is a majority in the voteset, a leader will be generated. However, in a configuration of 2 nodes,
                        * the number to achieve the majority remains 2, which means a recovered node cannot generate a leader which is
                        * the existed leader. Therefore, we need the Oracle to kick in this situation. In a two-node configuration, the Oracle
                        * only grants the permission to maintain the progress to one node. The oracle either grants the permission to the
                        * remained node and makes it a new leader when there is a faulty machine, which is the case to maintain the progress.
                        * Otherwise, the oracle does not grant the permission to the remained node, which further causes a service down.
                        *
                        * In the former case, when a failed server recovers and participate in the leader election, it would not locate a
                        * new leader because there does not exist a majority in the voteset. It fails on the containAllQuorum() infinitely due to
                        * two facts. First one is the fact that it does do not have a majority in the voteset. The other fact is the fact that
                        * the oracle would not give the permission since the oracle already gave the permission to the existed leader, the healthy machine.
                        * Logically, when the oracle replies with negative, it implies the existed leader which is LEADING notification comes from is a valid leader.
                        * To threat this negative replies as a permission to generate the leader is the purpose to separate these two behaviors.
                        *
                        *
                        * */
                    case FOLLOWING:
                        /*
                        * To avoid duplicate codes
                        * */
                        Vote resultFN = receivedFollowingNotification(recvset, outofelection, voteSet, n);
                        if (resultFN == null) {
                            break;
                        } else {
                            return resultFN;
                        }
                    case LEADING:
                        /*
                        * In leadingBehavior(), it performs followingBehvior() first. When followingBehavior() returns
                        * a null pointer, ask Oracle whether to follow this leader.
                        * */
                        Vote resultLN = receivedLeadingNotification(recvset, outofelection, voteSet, n);
                        if (resultLN == null) {
                            break;
                        } else {
                            return resultLN;
                        }
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {}(n.sid)", n.state, n.sid);
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if (self.jmxLeaderElectionBean != null) {
                    MBeanRegistry.getInstance().unregister(self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());
        }
    }

TXL永远爱WXY

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
细读经典第一期——从Paxos到Zookeeper 分布式一致性原理与实践（4）

目录七、Zookeeper技术内幕7.1系统模型7.1.1数据模型7.1.2 节点类型7.1.3 Znode的版本信息——保证分布式数据操作的原子性7.1.4 Watcher——数据变更通知7.1.5 ACL访问控制7.2序列化协议7.3 客户端7.4 会话7.4.1 会话状态转换7.4.2 会话创建七、Zookeeper技术内幕7.1系统模型7.1.1数据模型Znode数据节点是最小单元，由Znode构成了树。对于每一个事务请求，Zo
复制链接

扫一扫

专栏目录