目录
7.1.3 Znode的版本信息——保证分布式数据操作的原子性
七、Zookeeper技术内幕
ZK最重要的一点就是利用自己的数据模型实现分布式系统的一致性
7.1系统模型
7.1.1数据模型
Znode数据节点是最小单元,由Znode构成了树。
对于每一个事务请求,Zookeeper都会为其分配一个全局的事务ID,用ZXID来表示,每个ZXID对应一次更新操作,ZXID用来保证全局顺序操作
7.1.2 节点类型
持久节点(PERSISITENT),临时节点(EPHEMERAL)和顺序节点(SEQUENTIAL),具体创建过程中,会形成
持久节点,持久顺序节点,临时节点,临时顺序节点四类,其中临时节点只能作为叶子节点,且与会话绑定,而非TCP连接
7.1.3 Znode的版本信息——保证分布式数据操作的原子性
Znode的版本信息用来保证分布式数据的原子性操作,每个数据节点有三个版本信息
version表示当前节点自从创建之后被更新的次数,即使数据一样,只要版本发生变化,version就会改变,例如当一个Znode被创建后,其version为0,同样,这样的改变也会存在所谓ABA问题,所以这里的版本并不是指Znode的本身的内容是否变化,而是是否有用户对节点内容进行变更。
乐观锁控制事务分成如下三个阶段:数据读取,写入校验,和数据写入
Znode在进行setDataRequest时会进行版本比较,客户端可以使用CAS,也可以不使用CAS,如果使用但版本不一致,会抛出BadVersionException异常
7.1.4 Watcher——数据变更通知
客户端向服务端注册Watcher监听,当服务端一些指定的事件触发了Watcher,那么就会向指定客户端发送一个事件通知来实现分布式通知功能
Client向zookeeper注册Watcher时,会将Watcher信息存储在WatchManager中,当服务端出发Watcher事件之后,会向客户端发送通知,客户端线程从WatchManager中取出Watcher执行相应的回调逻辑。
Watcher接口内定义了两个枚举:KeeperState和EventType和一个方法process(WatchedEvent event)
本质上,还是在服务端维护了一个Map<String,Set<Watcher>>,将节点路径和该节点路径的Set集合进行映射。
总结Watcher特性:
一次性:一旦一个Watcher被触发,就会从Set<Watcher>中移除该Watcher,因此在使用Watcher时,要确定是否需要在Watcher触发后再次注册Watcher。
轻量:WatchedEvent是整个Zookeeper做Watcher的最小通知单元,且只包含三个成员变量,也就是说process回调只会告诉客户端发生了事件,而不会说明事件的具体内容,对于变更前后的数据都需要客户端自己去获取。从而做到轻量级的通知机制。
7.1.5 ACL访问控制
UGO(user,group,others)转为ACL(Scheme:id:permission)
7.2序列化协议
Zk采用jute序列化组件
7.3 客户端
核心部件:
Zookeeper实例:ClientWatcherManager客户端WatcherManager
HostProvider:服务器地址管理器
ClientCnxn:客户端核心线程,其中又包含SendThread(用于建立TCP通信)和EventThread(事件处理线程),ZK客户端创建一次会话的过程:
7.4 会话
7.4.1 会话状态转换
图片来源:[ZooKeeper]ZooKeeper的会话状态_zjysource的专栏-CSDN博客_zookeeper会话状态
会话状态CONNECTING,CONNECTED,RECONNECTING,RECONNECTED,CLOSE。
7.4.2 会话创建
(1)Session客户端会话实体
interface Session {
// 用sessionId唯一标识一个会话
long getSessionId();
// 超时时间
int getTimeout();
// 是否已经关闭
boolean isClosing();
}
sessionId生成规则,左移24位是为了把二进制日期前的0都移除,之后无符号右移8位,把高8位给id腾出位置,之后把id按位与在64位时间的高8位,最终得一个根据时间唯一确定的sessionId。(每次一有时间生成id我就想到时钟回拨。。)
public static long initializeNextSessionId(long id) {
long nextSid;
nextSid = (Time.currentElapsedTime() << 24) >>> 8;
nextSid = nextSid | (id << 56);
if (nextSid == EphemeralType.CONTAINER_EPHEMERAL_OWNER) {
++nextSid; // this is an unlikely edge case, but check it just in case
}
return nextSid;
}
最后,为了管理session,ZK提供了SessionTracker接口进行管理。
// sessionId和session的映射
protected final ConcurrentHashMap<Long, SessionImpl> sessionsById = new ConcurrentHashMap<Long, SessionImpl>();
// session和session超时时间的映射
protected final ConcurrentMap<Long, Integer> sessionsWithTimeout;
7.4.3 会话管理
ZK的会话管理主要是由SessionTracker负责,采用“分桶策略” (不重要)
7.6 Leader选举
一台服务器在ZK集群选举中可能扮演的角色:Leader,Observer,Follower。一台服务器在整个选举的过程中可能存在的状态包括:Looking,Following,Leading,Observing
7.6.1选举概述
1、初始化集群时的Leader选择
ZK集群模式至少是2台服务器起,集群模式下需要在zoo.cfg文件的dataDir路径下,为当前机器创建myid,用来唯一标识集群中的这台ZK。这里以3台服务器为例,三台机器的myid分别为myid1,myid2和myid3。集群开始启动,当集群中仅有ZK1时,无法进行Leader选举,此时集群无Leader,当ZK2开始启动,并且与ZK1建立通信之后,集群可以开始选举Leader,进入Leader选举流程。
(1)每个Server发出投票,以(myid,ZXID)的形式进行投票,因为是初始化阶段,每台服务器都投给自己,ZXID为0,所以ZK1投票(1,0),ZK2投票(2,0)并将投票结果发给集群内的其他ZK;
(2)每台服务器接收投票,并验证投票可靠性(是否本轮,是否来自Looking状态的服务器)
(3)处理投票,每台服务器拿收到的投票和自己的投票比,比较ZXID,哪个版本高选择那个票当最终投票,如果ZXID相同,就比较myid,哪个大就选择哪个票当最终投票。所以对于当前投票,ZK1在收到ZK2的(2,0)投票之后,就把自己之前投票(1,0)改为(2,0),ZK2在收到ZK1投票之后,把自己原投票作为最终票,ZK1,ZK2再次向集群中所有ZK发送自己的最终投票结果。
(4)投票统计
投票之后,ZK1和ZK2均收到(2,0)投票,由于集群内节点数量为3,ZK1和ZK2均拿到两票(2,0),此时ZK2的投票结果已经大于半数(2/n+1,n=3),所以ZK1和ZK2均认为已经选出了Leader。
(5)更改服务器状态
在投票阶段,服务器状态均为Looking,一旦确定了Leader,服务器就会更改自己的状态,如果是Follower,则将自己的状态改为Following,如果是Leader,则将自己的状态改为Leading,如果是Observer则将自己的状态改为Observing。
2、服务器运行期间的Leader选举
当Zookeeper已经选举完Leader并正常运行后,非Leader节点的上下线并不会影响集群的Leader节点。但是一旦Leader节点挂了,那么集群将无法对外提供服务,集群将进入新一轮的Leader选举。
(1)变更状态
当Leader挂了之后,所有的非Observer节点会将自己的节点变更为Looking
(2)每个Server发起投票
同样生成投票(myid,ZXID),由于这是在运行期间,各个节点的ZXID可能不同,和初始化时同理,ZK1生成投票(1,122),ZK2生成投票(2,123)并将自己的投票进行广播
(3)各个ZK接受投票结果
(4)投票统计,同样是先比较ZXID,再比较myid,因此各个节点投票统一(2,123)
(5)更改各个服务器状态,Leader节点改为Leading,Follower节点改为Following,如果是Observer则将自己的状态改为Observing。
总结:ZXID越大越容易成为Leader,ZXID相同,myid越大成为Leader的概率越大
7.6.2 选举算法
老版本的ZK提供了三种选举算法,不过目前ZK仅仅保留了TCP版本的FastLeaderElection封装在类FastLeaderElection中
7.6.3 具体实现细节
下面看看源码:
先看投票
public class Vote {
// 当前ZK的myid
private final long id;
// 当前ZK的事务ID ZXID
private final long zxid;
// 逻辑时钟,没赋值则默认为-1,每开始一轮投票+1,确保各个ZK节点收到的投票均在同一轮投票中
private final long electionEpoch;
// 被选举的Leader的epoch版本号
private final long peerEpoch;
// 当前ZK的状态,enum:LOOKING,FOLLOWING,LEADING,OBSERVING
private final ServerState state;
}
之后就是投票算法实现:类FastLeaderElection,源码很长,看几个关键的方法。
/**
* Starts a new round of leader election. Whenever our QuorumPeer
* changes its state to LOOKING, this method is invoked, and it
* sends notifications to all other peers.
*/
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
self.start_fle = Time.currentElapsedTime();
try {
/*
* The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
* if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
* of participants has voted for it.
*/
Map<Long, Vote> recvset = new HashMap<Long, Vote>();
/*
* The votes from previous leader elections, as well as the votes from the current leader election are
* stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
* Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
* outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
* the electionEpoch of the received notifications) in a leader election.
*/
Map<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = minNotificationInterval;
synchronized (this) {
logicalclock.incrementAndGet();
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info(
"New election. My id = {}, proposed zxid=0x{}",
self.getId(),
Long.toHexString(proposedZxid));
sendNotifications();
SyncedLearnerTracker voteSet = null;
/*
* Loop in which we exchange notifications until we find a leader
*/
while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if (n == null) {
if (manager.haveDelivered()) {
sendNotifications();
} else {
manager.connectAll();
}
/*
* Exponential backoff
*/
notTimeout = Math.min(notTimeout << 1, maxNotificationInterval);
/*
* When a leader failure happens on a master, the backup will be supposed to receive the honour from
* Oracle and become a leader, but the honour is likely to be delay. We do a re-check once timeout happens
*
* The leader election algorithm does not provide the ability of electing a leader from a single instance
* which is in a configuration of 2 instances.
* */
if (self.getQuorumVerifier() instanceof QuorumOracleMaj
&& self.getQuorumVerifier().revalidateVoteset(voteSet, notTimeout != minNotificationInterval)) {
setPeerState(proposedLeader, voteSet);
Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
leaveInstance(endVote);
return endVote;
}
LOG.info("Notification time out: {} ms", notTimeout);
} else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
if (getInitLastLoggedZxid() == -1) {
LOG.debug("Ignoring notification as our zxid is -1");
break;
}
if (n.zxid == -1) {
LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
break;
}
// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
recvset.clear();
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
LOG.debug(
"Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
Long.toHexString(n.electionEpoch),
Long.toHexString(logicalclock.get()));
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
LOG.debug(
"Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",
n.sid,
n.leader,
Long.toHexString(n.zxid),
Long.toHexString(n.electionEpoch));
// don't care about the version if it's in LOOKING state
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
if (voteSet.hasAllQuorums()) {
// Verify if there is any change in the proposed leader
while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
setPeerState(proposedLeader, voteSet);
Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: {}", n.sid);
break;
/*
* In ZOOKEEPER-3922, we separate the behaviors of FOLLOWING and LEADING.
* To avoid the duplication of codes, we create a method called followingBehavior which was used to
* shared by FOLLOWING and LEADING. This method returns a Vote. When the returned Vote is null, it follows
* the original idea to break swtich statement; otherwise, a valid returned Vote indicates, a leader
* is generated.
*
* The reason why we need to separate these behaviors is to make the algorithm runnable for 2-node
* setting. An extra condition for generating leader is needed. Due to the majority rule, only when
* there is a majority in the voteset, a leader will be generated. However, in a configuration of 2 nodes,
* the number to achieve the majority remains 2, which means a recovered node cannot generate a leader which is
* the existed leader. Therefore, we need the Oracle to kick in this situation. In a two-node configuration, the Oracle
* only grants the permission to maintain the progress to one node. The oracle either grants the permission to the
* remained node and makes it a new leader when there is a faulty machine, which is the case to maintain the progress.
* Otherwise, the oracle does not grant the permission to the remained node, which further causes a service down.
*
* In the former case, when a failed server recovers and participate in the leader election, it would not locate a
* new leader because there does not exist a majority in the voteset. It fails on the containAllQuorum() infinitely due to
* two facts. First one is the fact that it does do not have a majority in the voteset. The other fact is the fact that
* the oracle would not give the permission since the oracle already gave the permission to the existed leader, the healthy machine.
* Logically, when the oracle replies with negative, it implies the existed leader which is LEADING notification comes from is a valid leader.
* To threat this negative replies as a permission to generate the leader is the purpose to separate these two behaviors.
*
*
* */
case FOLLOWING:
/*
* To avoid duplicate codes
* */
Vote resultFN = receivedFollowingNotification(recvset, outofelection, voteSet, n);
if (resultFN == null) {
break;
} else {
return resultFN;
}
case LEADING:
/*
* In leadingBehavior(), it performs followingBehvior() first. When followingBehavior() returns
* a null pointer, ask Oracle whether to follow this leader.
* */
Vote resultLN = receivedLeadingNotification(recvset, outofelection, voteSet, n);
if (resultLN == null) {
break;
} else {
return resultLN;
}
default:
LOG.warn("Notification state unrecognized: {} (n.state), {}(n.sid)", n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if (self.jmxLeaderElectionBean != null) {
MBeanRegistry.getInstance().unregister(self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());
}
}