一、ZooKeeper术语
- 集群角色
角色 | 任务 |
---|---|
Leader | 1》事物请求的唯一调度和处理者,保证集群事物处理的顺序性。2》集群内部各服务器的调度者。 |
Follower | 1》处理客户端非事物请求、转发事物请求给 leader 服务器。2》参与事物请求 Proposal 的投票(Leader 发起的提案,要求 Follower 投票,需要半数以上follower节点通过,leader才会commit数据。)3》参与 Leader 选举的投票。 |
Observer | observer服务器只提供非事物请求服务,通常在于不影响集群事务处理能力的前提下提升集群非事物处理的能力。 |
-
会话
客户端与服务端之间任何交互操作都与会话息息相关,如临时节点的生命周期、客户端请求的执行顺序、Watcher通知机制等。Zookeeper的连接与会话就是客户端通过实例化Zookeeper对象来实现客户端与服务端创建并保持TCP连接的过程。 -
Zxid
zxid,也就是事务id, 为了保证事务的顺序一致性,zookeeper 采用了递增的事 务 id 号(zxid)来标识事务。所有的提议(proposal)都 在被提出的时候加上了 zxid。实现中 zxid 是一个 64 位的 数字,它高32位是epoch(ZAB协议通过epoch编号来 区分 Leader 周期变化的策略)用来标识 leader 关系是否 改变,每次一个 leader 被选出来,它都会有一个新的 epoch=(原来的epoch+1),标识当前属于那个leader的 统治时期。低32位用于递增计数。 -
数据节点
节点类型 | 含义 | 解析 |
---|---|---|
PERSISTENT | 持久化目录节点 | 客户端与zookeeper断开连接后,该节点依旧存在 |
PERSISTENT_SEQUENTIAL | 持久化顺序编号目录节点 | 客户端与zookeeper断开连接后,该节点依旧存在,只是Zookeeper给该节点名称进行顺序编号 |
EPHEMERAL | -临时目录节点 | 客户端与zookeeper断开连接后,该节点被删除 |
EPHEMERAL_SEQUENTIAL | 临时顺序编号目录节点 | 客户端与zookeeper断开连接后,该节点被删除,只是Zookeeper给该节点名称进行顺序编号 |
- ACL
权限控制,使用:scheme: id:perm 来标识,主要涵盖 3 个方面:
权限模式(Scheme):授权的策略
授权对象(ID):授权的对象
权限(Permission):授予的权限
其特性如下:
ZooKeeper的权限控制是基于每个znode节点的,需要对每个节点设置权限
每个znode支持设置多种权限控制方案和多个权限
子节点不会继承父节点的权限,客户端无权访问某节点,但可能可以访问它的子节点
二、原理及源码分析
Server端
从启动脚本 zkServer.sh 中可以看出启动类为org.apache.zookeeper.server.quorum.QuorumPeerMain
主方法只有org.apache.zookeeper.server.quorum.QuorumPeerMain#initializeAndRun。
org.apache.zookeeper.server.quorum.QuorumPeerConfig是配置类,解析zoo.cfg,主要看org.apache.zookeeper.server.quorum.QuorumPeerConfig#parseProperties方法。
org.apache.zookeeper.server.DatadirCleanupManager作用是启动并安排清除任务,清除什么?
// Start and schedule the the purge task
DatadirCleanupManager purgeMgr = new DatadirCleanupManager(
config.getDataDir(),
config.getDataLogDir(),
config.getSnapRetainCount(),
config.getPurgeInterval());
purgeMgr.start();
清理日志和快照文件(不是特别重要的功能)。snapRetainCount 指定保留文件数,最少3个,zoo.cfg配置项为autopurge.snapRetainCount;purgeInterval指定清理频率,单位小时,默认0即不开启自动清理,zoo.cfg配置项为autopurge.purgeInterval。
if (args.length == 1 && config.isDistributed()) {
//集群模式
runFromConfig(config);
} else {
LOG.warn("Either no config or no quorum defined in config, running in standalone mode");
//走单机模式
ZooKeeperServerMain.main(args);
}
走集群模式 org.apache.zookeeper.server.quorum.QuorumPeerMain#runFromConfig
现在来看下 args.length == 1 && config.isDistributed() 这个和集群的关系
public boolean isDistributed() {
//要搞清楚 quorumVerifier和standaloneEnabled 是什么?怎么来的?
return quorumVerifier != null && (!standaloneEnabled || quorumVerifier.getVotingMembers().size() > 1);
}
先看 standaloneEnabled
//默认单节点部署,可以通过 reconfigEnabled 属性设置,我的配置文件没有设置,这里一直是true
private static boolean standaloneEnabled = true;
再看 quorumVerifier
void setupQuorumPeerConfig(Properties prop, boolean configBackwardCompatibilityMode) throws IOException, ConfigException {
quorumVerifier = parseDynamicConfig(prop, electionAlg, true, configBackwardCompatibilityMode);
setupMyId();
setupClientPort();
setupPeerType();
checkValidity();
}
//根据调用链最终执行下列方法,这里根据zoo.cfg里面的server.n配置获取IP和端口
public QuorumMaj(Properties props) throws ConfigException {
for (Entry<Object, Object> entry : props.entrySet()) {
String key = entry.getKey().toString();
String value = entry.getValue().toString();
if (key.startsWith("server.")) {
int dot = key.indexOf('.');
long sid = Long.parseLong(key.substring(dot + 1));
QuorumServer qs = new QuorumServer(sid, value);
allMembers.put(Long.valueOf(sid), qs);
if (qs.type == LearnerType.PARTICIPANT) {
votingMembers.put(Long.valueOf(sid), qs);
} else {
observingMembers.put(Long.valueOf(sid), qs);
}
} else if (key.equals("version")) {
version = Long.parseLong(value, 16);
}
}
//半数
half = votingMembers.size() / 2;
}
走单机模式 org.apache.zookeeper.server.ZooKeeperServerMain#main 就不看了
接下来看看org.apache.zookeeper.server.quorum.QuorumPeerMain#runFromConfig,顾名思义“根据配置运行”
public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
try {
//注册log4j的Bean
ManagedUtil.registerLog4jMBeans();
} catch (JMException e) {
LOG.warn("Unable to register log4j JMX control", e);
}
LOG.info("Starting quorum peer, myid=" + config.getServerId());
MetricsProvider metricsProvider;
try {
//先启动度量程序 MetricsProviderClassName来源于 zoo.cfg 的 metricsProvider.className,默认DefaultMetricsProvider.class.getName();MetricsProviderConfiguration来源于 zoo.cfg metricsProvider.配置
//org.apache.zookeeper.metrics.impl.DefaultMetricsProvider#start 默认的是个空方法
metricsProvider = MetricsProviderBootstrap.startMetricsProvider(
config.getMetricsProviderClassName(),
config.getMetricsProviderConfiguration());
} catch (MetricsProviderLifeCycleException error) {
throw new IOException("Cannot boot MetricsProvider " + config.getMetricsProviderClassName(), error);
}
try {
ServerMetrics.metricsProviderInitialized(metricsProvider);
//客户端与服务端的网络通信类
ServerCnxnFactory cnxnFactory = null;
//安全客户端与服务端的网络通信类
ServerCnxnFactory secureCnxnFactory = null;
//clientPortAddress在加载zoo.cfg中根据clientPort属性初始化了
if (config.getClientPortAddress() != null) {
//默认使用NIOServerCnxnFactory,另一个实现是NettyServerCnxnFactory,通过系统属性zookeeper.serverCnxnFactory设置
cnxnFactory = ServerCnxnFactory.createFactory();
//clientPortListenBacklog 背压,默认-1 maxClientCnxns 单个客户端与单台服务器之间的连接数的限制,是ip级别的,默认是60,如果设置为0,那么表明不作任何限制。
cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), false);
}
//SecureClientPortAddress在加载zoo.cfg中根据secureClientPort属性初始化了,我这边没有配置
if (config.getSecureClientPortAddress() != null) {
secureCnxnFactory = ServerCnxnFactory.createFactory();
secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), true);
}
//重点来看看QuorumPeer,此类管理仲裁协议.有三种状态:Leader election每个服务器将选举一个领导(提议自己作为一个领袖最初)、Leader服务器将处理请求并将它们转发给追随者,大多数关注者必须在请求被接受之前将其记录下来、Follower将与leader同步并复制任何事务
quorumPeer = getQuorumPeer();
quorumPeer.setTxnFactory(new FileTxnSnapLog(config.getDataLogDir(), config.getDataDir()));
quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
quorumPeer.enableLocalSessionsUpgrading(config.isLocalSessionsUpgradingEnabled());
//quorumPeer.setQuorumPeers(config.getAllMembers());
quorumPeer.setElectionType(config.getElectionAlg());
quorumPeer.setMyid(config.getServerId());
quorumPeer.setTickTime(config.getTickTime());
quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
quorumPeer.setInitLimit(config.getInitLimit());
quorumPeer.setSyncLimit(config.getSyncLimit());
quorumPeer.setConnectToLearnerMasterLimit(config.getConnectToLearnerMasterLimit());
quorumPeer.setObserverMasterPort(config.getObserverMasterPort());
quorumPeer.setConfigFileName(config.getConfigFilename());
quorumPeer.setClientPortListenBacklog(config.getClientPortListenBacklog());
quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
if (config.getLastSeenQuorumVerifier() != null) {
quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
}
quorumPeer.initConfigInZKDatabase();
quorumPeer.setCnxnFactory(cnxnFactory);
quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
quorumPeer.setSslQuorum(config.isSslQuorum());
quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
quorumPeer.setLearnerType(config.getPeerType());
quorumPeer.setSyncEnabled(config.getSyncEnabled());
quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
if (config.sslQuorumReloadCertFiles) {
quorumPeer.getX509Util().enableCertFileReloading();
}
quorumPeer.setMultiAddressEnabled(config.isMultiAddressEnabled());
quorumPeer.setMultiAddressReachabilityCheckEnabled(config.isMultiAddressReachabilityCheckEnabled());
quorumPeer.setMultiAddressReachabilityCheckTimeoutMs(config.getMultiAddressReachabilityCheckTimeoutMs());
// sets quorum sasl authentication configurations
quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
if (quorumPeer.isQuorumSaslAuthEnabled()) {
quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
}
quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
quorumPeer.initialize();
if (config.jvmPauseMonitorToRun) {
quorumPeer.setJvmPauseMonitor(new JvmPauseMonitor(config));
}
//前面都是在设置属性,下面是重中之重
quorumPeer.start();
ZKAuditProvider.addZKStartStopAuditLog();
quorumPeer.join();
} catch (InterruptedException e) {
// warn, but generally this is ok
LOG.warn("Quorum Peer interrupted", e);
} finally {
if (metricsProvider != null) {
try {
metricsProvider.stop();
} catch (Throwable error) {
LOG.warn("Error while stopping metrics", error);
}
}
}
}
quorumPeer.start()做了什么操作?
@Override
public synchronized void start() {
if (!getView().containsKey(myid)) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
//从硬盘加载数据(数据准备)
loadDataBase();
//启动端口监听和数据处理
startServerCnxnFactory();
try {
//控制台启动
adminServer.start();
} catch (AdminServerException e) {
LOG.warn("Problem starting AdminServer", e);
System.out.println(e);
}
//选举,后面单独写一篇
startLeaderElection();
//启动监控
startJvmPauseMonitor();
//启动线程
super.start();
}
loadDataBase()怎么准备数据?
private void loadDataBase() {
try {
zkDb.loadDataBase();
// load the epochs
long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
try {
currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
} catch (FileNotFoundException e) {
// pick a reasonable epoch number
// this should only happen once when moving to a
// new code version
currentEpoch = epochOfZxid;
LOG.info(
"{} not found! Creating with a reasonable default of {}. "
+ "This should only happen when you are upgrading your installation",
CURRENT_EPOCH_FILENAME,
currentEpoch);
writeLongToFile(CURRENT_EPOCH_FILENAME, currentEpoch);
}
if (epochOfZxid > currentEpoch) {
// acceptedEpoch.tmp file in snapshot directory
File currentTmp = new File(getTxnFactory().getSnapDir(),
CURRENT_EPOCH_FILENAME + AtomicFileOutputStream.TMP_EXTENSION);
if (currentTmp.exists()) {
long epochOfTmp = readLongFromFile(currentTmp.getName());
LOG.info("{} found. Setting current epoch to {}.", currentTmp, epochOfTmp);
setCurrentEpoch(epochOfTmp);
} else {
throw new IOException(
"The current epoch, " + ZxidUtils.zxidToString(currentEpoch)
+ ", is older than the last zxid, " + lastProcessedZxid);
}
}
try {
acceptedEpoch = readLongFromFile(ACCEPTED_EPOCH_FILENAME);
} catch (FileNotFoundException e) {
// pick a reasonable epoch number
// this should only happen once when moving to a
// new code version
acceptedEpoch = epochOfZxid;
LOG.info(
"{} not found! Creating with a reasonable default of {}. "
+ "This should only happen when you are upgrading your installation",
ACCEPTED_EPOCH_FILENAME,
acceptedEpoch);
writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch);
}
if (acceptedEpoch < currentEpoch) {
throw new IOException("The accepted epoch, "
+ ZxidUtils.zxidToString(acceptedEpoch)
+ " is less than the current epoch, "
+ ZxidUtils.zxidToString(currentEpoch));
}
} catch (IOException ie) {
LOG.error("Unable to load database on disk", ie);
throw new RuntimeException("Unable to run quorum server ", ie);
}
}
首先出现的是org.apache.zookeeper.server.ZKDatabase#loadDataBase,这个方法直观看到返回一个zxid.
org.apache.zookeeper.server.persistence.FileTxnSnapLog#restore
--从dataDir目录中获取 snapshot 开头的文件,按名称倒序取100个,详情看org.apache.zookeeper.server.persistence.FileSnap#findNValidSnapshots方法。
->org.apache.zookeeper.server.persistence.SnapShot#deserialize
--通过该方法将快照文件反序列化到dataTree中
->org.apache.zookeeper.server.util.SerializeUtils#deserializeSnapshot
--从文件名中获取最后lastProcessedZxid,主要是将snapshot.xxxxxx点后面的16进制转成Long类型。
->org.apache.zookeeper.server.persistence.Util#getZxidFromName
--找到不小于lastProcessedZxid+1的文件和zxid小于lastProcessedZxid+1但是最接近lastProcessedZxid+1的log.xxxxxxxxx文件存入storedFiles .迭代遍历storedFiles的每个文件,将文件转为InputArchive对象 ,并通过该对象反序列化到hdr(事务头)和txn(事务体record)、digest(摘要)中.
->org.apache.zookeeper.server.persistence.FileTxnSnapLog#fastForwardFromEdits
--每个事务日志(hdr+txn),将其存入dataTree中,得到最新的Zxid。内存数据库dataTree初始化完毕.
->org.apache.zookeeper.server.persistence.FileTxnSnapLog#processTransaction
startServerCnxnFactory()启动端口监听
private void startServerCnxnFactory() {
if (cnxnFactory != null) {
cnxnFactory.start();
}
if (secureCnxnFactory != null) {
secureCnxnFactory.start();
}
}
//@See org.apache.zookeeper.server.NIOServerCnxnFactory#start
@Override
public void start() {
stopped = false;
if (workerPool == null) {
//内部新建了个工作线程池,workers.add(Executors.newFixedThreadPool(numWorkerThreads, new DaemonThreadFactory(threadNamePrefix)));使用了无界队列,numWorkerThreads默认2倍的机器线程数,可通过系统参数zookeeper.nio.numWorkerThreads设置。
workerPool = new WorkerService("NIOWorker", numWorkerThreads, false);
}
//org.apache.zookeeper.server.NIOServerCnxnFactory#configure方法赋值了selectorThreads、acceptThread、expirerThread。
//selectorThreads、acceptThread、expirerThread具体怎么定义?功能是什么?
//selectorThreads个数由zookeeper工程师根据经验公式Math.max((int) Math.sqrt((float) cpu核心数 / 2), 1)或通过系统参数zookeeper.nio.numSelectorThreads设置。作用是将连接分派到SelectorThread上,负责具体的读写数据。
for (SelectorThread thread : selectorThreads) {
if (thread.getState() == Thread.State.NEW) {
thread.start();
}
}
//org.apache.zookeeper.server.NIOServerCnxnFactory.AcceptThread#run 处理客户端连接,SocketChannel最终通过org.apache.zookeeper.server.NIOServerCnxnFactory.AcceptThread#doAccept中的selectorThread.addAcceptedConnection(sc)方法加入到SelectorThread线程中。
// ensure thread is started once and only once
if (acceptThread.getState() == Thread.State.NEW) {
acceptThread.start();
}
//expirerThread线程主要是清理超时连接,详看org.apache.zookeeper.server.NIOServerCnxnFactory.ConnectionExpirerThread#run
if (expirerThread.getState() == Thread.State.NEW) {
expirerThread.start();
}
}
workerPool 内容如下:
SelectorThread具体实现
org.apache.zookeeper.server.PrepRequestProcessor#pRequest的内容是不是很熟悉!就是我们用客户端连接zookeeper的操作:create、delete、setData等等。
此外leader和flower的数据同步没有写。
Client端
client端的代码相对来说会简单些
从启动脚本zkCli.sh可以看出调用org.apache.zookeeper.ZooKeeperMain方法,先实例化ZooKeeperMain对象,再调用run()方法。
public boolean parseOptions(String[] args) {
List<String> argList = Arrays.asList(args);
Iterator<String> it = argList.iterator();
while (it.hasNext()) {
String opt = it.next();
try {
//这里是不是很熟悉,就是我们启动时候输入的一些参数
if (opt.equals("-server")) {
options.put("server", it.next());
} else if (opt.equals("-timeout")) {
options.put("timeout", it.next());
} else if (opt.equals("-r")) {
options.put("readonly", "true");
} else if (opt.equals("-client-configuration")) {
options.put("client-configuration", it.next());
} else if (opt.equals("-waitforconnection")) {
options.put("waitforconnection", "true");
}
} catch (NoSuchElementException e) {
System.err.println("Error: no argument found for option " + opt);
return false;
}
if (!opt.startsWith("-")) {
command = opt;
cmdArgs = new ArrayList<String>();
cmdArgs.add(command);
while (it.hasNext()) {
cmdArgs.add(it.next());
}
return true;
}
}
return true;
}
org.apache.zookeeper.ZooKeeperMain#connectToZK顾名思义就是去连接Zookeeper
protected void connectToZK(String newHost) throws InterruptedException, IOException {
if (zk != null && zk.getState().isAlive()) {
zk.close();
}
host = newHost;
boolean readOnly = cl.getOption("readonly") != null;
if (cl.getOption("secure") != null) {
System.setProperty(ZKClientConfig.SECURE_CLIENT, "true");
System.out.println("Secure connection is enabled");
}
ZKClientConfig clientConfig = null;
if (cl.getOption("client-configuration") != null) {
try {
clientConfig = new ZKClientConfig(cl.getOption("client-configuration"));
} catch (QuorumPeerConfig.ConfigException e) {
e.printStackTrace();
ServiceUtils.requestSystemExit(ExitCode.INVALID_INVOCATION.getValue());
}
}
if (cl.getOption("waitforconnection") != null) {
connectLatch = new CountDownLatch(1);
}
int timeout = Integer.parseInt(cl.getOption("timeout"));
//初始化ZooKeeper,里面初始化了 SendThread 和 EventThread,SendThread发送和接收数据,EventThread执行事件回调
zk = new ZooKeeperAdmin(host, timeout, new MyWatcher(), readOnly, clientConfig);
if (connectLatch != null) {
if (!connectLatch.await(timeout, TimeUnit.MILLISECONDS)) {
zk.close();
throw new IOException(KeeperException.create(KeeperException.Code.CONNECTIONLOSS));
}
connectLatch = null;
}
}
之后看下org.apache.zookeeper.ZooKeeperMain#run方法
jline.console.ConsoleReader 是用来处理控制台输入的Java类库,如果没有的话 BufferedReader 获取键盘输入,都是会阻塞至有输入为止。
//最终都调用本方法
public void executeLine(String line) throws InterruptedException, IOException {
if (!line.equals("")) {
//解析命令
cl.parseCommand(line);
//记录操作
addToHistory(commandCount, line);
//处理操作
processCmd(cl);
//操作数自增
commandCount++;
}
}
最终处理的方法 processZKCmd()
protected boolean processZKCmd(MyCommandOptions co) throws CliException, IOException, InterruptedException {
String[] args = co.getArgArray();
String cmd = co.getCommand();
if (args.length < 1) {
usage();
throw new MalformedCommandException("No command entered");
}
if (!commandMap.containsKey(cmd)) {
usage();
throw new CommandNotFoundException("Command not found " + cmd);
}
boolean watch = false;
LOG.debug("Processing {}", cmd);
if (cmd.equals("quit")) {
zk.close();
ServiceUtils.requestSystemExit(exitCode);
} else if (cmd.equals("redo") && args.length >= 2) {
Integer i = Integer.decode(args[1]);
if (commandCount <= i || i < 0) { // don't allow redoing this redo
throw new MalformedCommandException("Command index out of range");
}
cl.parseCommand(history.get(i));
if (cl.getCommand().equals("redo")) {
throw new MalformedCommandException("No redoing redos");
}
history.put(commandCount, history.get(i));
processCmd(cl);
} else if (cmd.equals("history")) {
for (int i = commandCount - 10; i <= commandCount; ++i) {
if (i < 0) {
continue;
}
System.out.println(i + " - " + history.get(i));
}
} else if (cmd.equals("printwatches")) {
if (args.length == 1) {
System.out.println("printwatches is " + (printWatches ? "on" : "off"));
} else {
printWatches = args[1].equals("on");
}
} else if (cmd.equals("connect")) {
if (args.length >= 2) {
connectToZK(args[1]);
} else {
connectToZK(host);
}
}
// Below commands all need a live connection
if (zk == null || !zk.getState().isAlive()) {
System.out.println("Not connected");
return false;
}
// execute from commandMap
CliCommand cliCmd = commandMapCli.get(cmd);
if (cliCmd != null) {
cliCmd.setZk(zk);
//根据具体的操作获取 org.apache.zookeeper.cli.CliCommand 的实现,并执行 exec()方法
watch = cliCmd.parse(args).exec();
} else if (!commandMap.containsKey(cmd)) {
usage();
}
return watch;
}
随便选取一个CliCommand的子类,例如GetCommand,查看exec()方法,调用的是org.apache.zookeeper.ZooKeeper#getData(java.lang.String, org.apache.zookeeper.Watcher, org.apache.zookeeper.data.Stat)
public byte[] getData(final String path, Watcher watcher, Stat stat) throws KeeperException, InterruptedException {
final String clientPath = path;
PathUtils.validatePath(clientPath);
// the watch contains the un-chroot path
WatchRegistration wcb = null;
if (watcher != null) {
wcb = new DataWatchRegistration(watcher, clientPath);
}
final String serverPath = prependChroot(clientPath);
RequestHeader h = new RequestHeader();
h.setType(ZooDefs.OpCode.getData);
GetDataRequest request = new GetDataRequest();
request.setPath(serverPath);
request.setWatch(watcher != null);
GetDataResponse response = new GetDataResponse();
//提交请求,放入 待发送队列outgoingQueue中,由sendThread发送,并在超时时间范围内等待返回
ReplyHeader r = cnxn.submitRequest(h, request, response, wcb);
if (r.getErr() != 0) {
throw KeeperException.create(KeeperException.Code.get(r.getErr()), clientPath);
}
if (stat != null) {
DataTree.copyStat(response.getStat(), stat);
}
return response.getData();
}
SendThread是什么作用?
@Override
public void run() {
//outgoingQueue 引用到clientCnxnSocket
clientCnxnSocket.introduce(this, sessionId, outgoingQueue);
clientCnxnSocket.updateNow();
clientCnxnSocket.updateLastSendAndHeard();
int to;
long lastPingRwServer = Time.currentElapsedTime();
final int MAX_SEND_PING_INTERVAL = 10000; //10 seconds
InetSocketAddress serverAddress = null;
while (state.isAlive()) {
try {
if (!clientCnxnSocket.isConnected()) {
// don't re-establish connection if we are closing
if (closing) {
break;
}
if (rwServerAddress != null) {
serverAddress = rwServerAddress;
rwServerAddress = null;
} else {
serverAddress = hostProvider.next(1000);
}
onConnecting(serverAddress);
startConnect(serverAddress);
clientCnxnSocket.updateLastSendAndHeard();
}
if (state.isConnected()) {
// determine whether we need to send an AuthFailed event.
if (zooKeeperSaslClient != null) {
boolean sendAuthEvent = false;
if (zooKeeperSaslClient.getSaslState() == ZooKeeperSaslClient.SaslState.INITIAL) {
try {
zooKeeperSaslClient.initialize(ClientCnxn.this);
} catch (SaslException e) {
LOG.error("SASL authentication with Zookeeper Quorum member failed.", e);
changeZkState(States.AUTH_FAILED);
sendAuthEvent = true;
}
}
KeeperState authState = zooKeeperSaslClient.getKeeperState();
if (authState != null) {
if (authState == KeeperState.AuthFailed) {
// An authentication error occurred during authentication with the Zookeeper Server.
changeZkState(States.AUTH_FAILED);
sendAuthEvent = true;
} else {
if (authState == KeeperState.SaslAuthenticated) {
sendAuthEvent = true;
}
}
}
if (sendAuthEvent) {
eventThread.queueEvent(new WatchedEvent(Watcher.Event.EventType.None, authState, null));
if (state == States.AUTH_FAILED) {
eventThread.queueEventOfDeath();
}
}
}
to = readTimeout - clientCnxnSocket.getIdleRecv();
} else {
to = connectTimeout - clientCnxnSocket.getIdleRecv();
}
if (to <= 0) {
String warnInfo = String.format(
"Client session timed out, have not heard from server in %dms for session id 0x%s",
clientCnxnSocket.getIdleRecv(),
Long.toHexString(sessionId));
LOG.warn(warnInfo);
throw new SessionTimeoutException(warnInfo);
}
if (state.isConnected()) {
//1000(1 second) is to prevent race condition missing to send the second ping
//also make sure not to send too many pings when readTimeout is small
int timeToNextPing = readTimeout / 2
- clientCnxnSocket.getIdleSend()
- ((clientCnxnSocket.getIdleSend() > 1000) ? 1000 : 0);
//send a ping request either time is due or no packet sent out within MAX_SEND_PING_INTERVAL
if (timeToNextPing <= 0 || clientCnxnSocket.getIdleSend() > MAX_SEND_PING_INTERVAL) {
sendPing();
clientCnxnSocket.updateLastSend();
} else {
if (timeToNextPing < to) {
to = timeToNextPing;
}
}
}
// If we are in read-only mode, seek for read/write server
if (state == States.CONNECTEDREADONLY) {
long now = Time.currentElapsedTime();
int idlePingRwServer = (int) (now - lastPingRwServer);
if (idlePingRwServer >= pingRwTimeout) {
lastPingRwServer = now;
idlePingRwServer = 0;
pingRwTimeout = Math.min(2 * pingRwTimeout, maxPingRwTimeout);
pingRwServer();
}
to = Math.min(to, pingRwTimeout - idlePingRwServer);
}
//主要是该方法执行数据读写
clientCnxnSocket.doTransport(to, pendingQueue, ClientCnxn.this);
} catch (Throwable e) {
if (closing) {
// closing so this is expected
LOG.warn(
"An exception was thrown while closing send thread for session 0x{}.",
Long.toHexString(getSessionId()),
e);
break;
} else {
LOG.warn(
"Session 0x{} for sever {}, Closing socket connection. "
+ "Attempting reconnect except it is a SessionExpiredException.",
Long.toHexString(getSessionId()),
serverAddress,
e);
// At this point, there might still be new packets appended to outgoingQueue.
// they will be handled in next connection or cleared up if closed.
cleanAndNotifyState();
}
}
}
synchronized (state) {
// When it comes to this point, it guarantees that later queued
// packet to outgoingQueue will be notified of death.
cleanup();
}
clientCnxnSocket.close();
if (state.isAlive()) {
eventThread.queueEvent(new WatchedEvent(Event.EventType.None, Event.KeeperState.Disconnected, null));
}
eventThread.queueEvent(new WatchedEvent(Event.EventType.None, Event.KeeperState.Closed, null));
ZooTrace.logTraceMessage(
LOG,
ZooTrace.getTextTraceLevel(),
"SendThread exited loop for session: 0x" + Long.toHexString(getSessionId()));
}
Client默认Socket实现类是基本NIO,这里主要分析org.apache.zookeeper.ClientCnxnSocketNIO#doTransport,主要调用org.apache.zookeeper.ClientCnxnSocketNIO#doIO,分为读操作和写操作
读操作是等待服务端给返回数据即pendingQueue
//读取服务端返回的buffer
void readResponse(ByteBuffer incomingBuffer) throws IOException {
ByteBufferInputStream bbis = new ByteBufferInputStream(incomingBuffer);
BinaryInputArchive bbia = BinaryInputArchive.getArchive(bbis);
ReplyHeader replyHdr = new ReplyHeader();
replyHdr.deserialize(bbia, "header");
switch (replyHdr.getXid()) {
case PING_XID:
LOG.debug("Got ping response for session id: 0x{} after {}ms.",
Long.toHexString(sessionId),
((System.nanoTime() - lastPingSentNs) / 1000000));
return;
case AUTHPACKET_XID:
LOG.debug("Got auth session id: 0x{}", Long.toHexString(sessionId));
if (replyHdr.getErr() == KeeperException.Code.AUTHFAILED.intValue()) {
changeZkState(States.AUTH_FAILED);
eventThread.queueEvent(new WatchedEvent(Watcher.Event.EventType.None,
Watcher.Event.KeeperState.AuthFailed, null));
eventThread.queueEventOfDeath();
}
return;
case NOTIFICATION_XID:
LOG.debug("Got notification session id: 0x{}",
Long.toHexString(sessionId));
WatcherEvent event = new WatcherEvent();
event.deserialize(bbia, "response");
// convert from a server path to a client path
if (chrootPath != null) {
String serverPath = event.getPath();
if (serverPath.compareTo(chrootPath) == 0) {
event.setPath("/");
} else if (serverPath.length() > chrootPath.length()) {
event.setPath(serverPath.substring(chrootPath.length()));
} else {
LOG.warn("Got server path {} which is too short for chroot path {}.",
event.getPath(), chrootPath);
}
}
WatchedEvent we = new WatchedEvent(event);
LOG.debug("Got {} for session id 0x{}", we, Long.toHexString(sessionId));
//Watcher机制在客户端的实现
eventThread.queueEvent(we);
return;
default:
break;
}
// If SASL authentication is currently in progress, construct and
// send a response packet immediately, rather than queuing a
// response as with other packets.
if (tunnelAuthInProgress()) {
GetSASLRequest request = new GetSASLRequest();
request.deserialize(bbia, "token");
zooKeeperSaslClient.respondToServer(request.getToken(), ClientCnxn.this);
return;
}
Packet packet;
synchronized (pendingQueue) {
if (pendingQueue.size() == 0) {
throw new IOException("Nothing in the queue, but got " + replyHdr.getXid());
}
packet = pendingQueue.remove();
}
/*
* Since requests are processed in order, we better get a response
* to the first request!
*/
try {
if (packet.requestHeader.getXid() != replyHdr.getXid()) {
packet.replyHeader.setErr(KeeperException.Code.CONNECTIONLOSS.intValue());
throw new IOException("Xid out of order. Got Xid " + replyHdr.getXid()
+ " with err " + replyHdr.getErr()
+ " expected Xid " + packet.requestHeader.getXid()
+ " for a packet with details: " + packet);
}
packet.replyHeader.setXid(replyHdr.getXid());
packet.replyHeader.setErr(replyHdr.getErr());
packet.replyHeader.setZxid(replyHdr.getZxid());
if (replyHdr.getZxid() > 0) {
lastZxid = replyHdr.getZxid();
}
if (packet.response != null && replyHdr.getErr() == 0) {
packet.response.deserialize(bbia, "response");
}
LOG.debug("Reading reply session id: 0x{}, packet:: {}", Long.toHexString(sessionId), packet);
} finally {
//处理读到的非Ping、Auth、Notify的数据包
finishPacket(packet);
}
}
写操作是向服务端发送数据即outgoingQueue
if (sockKey.isWritable()) {
Packet p = findSendablePacket(outgoingQueue, sendThread.tunnelAuthInProgress());
if (p != null) {
updateLastSend();
// If we already started writing p, p.bb will already exist
if (p.bb == null) {
if ((p.requestHeader != null)
&& (p.requestHeader.getType() != OpCode.ping)
&& (p.requestHeader.getType() != OpCode.auth)) {
p.requestHeader.setXid(cnxn.getXid());
}
p.createBB();
}
//数据从缓存区写入,NIO和BIO最大的区别是面向缓冲区
sock.write(p.bb);
if (!p.bb.hasRemaining()) {
sentCount.getAndIncrement();
outgoingQueue.removeFirstOccurrence(p);
if (p.requestHeader != null
&& p.requestHeader.getType() != OpCode.ping
&& p.requestHeader.getType() != OpCode.auth) {
synchronized (pendingQueue) {
//放入等待接收数据queue
pendingQueue.add(p);
}
}
}
}
<!--后面省略-->
}
EventThread是干什么的?
从SendThread的实现org.apache.zookeeper.ClientCnxn.SendThread#readResponse可以看出调用了org.apache.zookeeper.ClientCnxn.EventThread#queueEvent(org.apache.zookeeper.WatchedEvent)和org.apache.zookeeper.ClientCnxn.EventThread#queuePacket方法,将数据放入waitingEvents队列中。waitingEvents就是EventThread的工作队列。
@Override
@SuppressFBWarnings("JLM_JSR166_UTILCONCURRENT_MONITORENTER")
public void run() {
try {
isRunning = true;
while (true) {
Object event = waitingEvents.take();
if (event == eventOfDeath) {
wasKilled = true;
} else {
//处理事件
processEvent(event);
}
if (wasKilled) {
synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
isRunning = false;
break;
}
}
}
}
} catch (InterruptedException e) {
LOG.error("Event thread exiting due to interruption", e);
}
LOG.info("EventThread shut down for session: 0x{}", Long.toHexString(getSessionId()));
}
在idea中运行如下
Watch机制原理
前面讲解了Server和Client的运行逻辑,Client端已经说明了Watcher是怎么触发的,这里再看下Server是怎么处理Watcher。
从org.apache.zookeeper.server.ZooKeeperServer#setupRequestProcessors中可以看出,数据处理是用了责任链模式
protected void setupRequestProcessors() {
RequestProcessor finalProcessor = new FinalRequestProcessor(this);
RequestProcessor syncProcessor = new SyncRequestProcessor(this, finalProcessor);
((SyncRequestProcessor) syncProcessor).start();
firstProcessor = new PrepRequestProcessor(this, syncProcessor);
((PrepRequestProcessor) firstProcessor).start();
}
最终在org.apache.zookeeper.server.FinalRequestProcessor#processRequest方法中完成写入内存数据和watcher注册到管理器上,以getData为例:
private Record handleGetDataRequest(Record request, ServerCnxn cnxn, List<Id> authInfo) throws KeeperException, IOException {
GetDataRequest getDataRequest = (GetDataRequest) request;
String path = getDataRequest.getPath();
DataNode n = zks.getZKDatabase().getNode(path);
if (n == null) {
throw new KeeperException.NoNodeException();
}
zks.checkACL(cnxn, zks.getZKDatabase().aclForNode(n), ZooDefs.Perms.READ, authInfo, path, null);
Stat stat = new Stat();
//如果有watch,把和客户端的连接传到下一个方法,向客户端发送数据用。
byte[] b = zks.getZKDatabase().getData(path, stat, getDataRequest.getWatch() ? cnxn : null);
return new GetDataResponse(b, stat);
}
//org.apache.zookeeper.server.ZKDatabase#getData
public byte[] getData(String path, Stat stat, Watcher watcher) throws KeeperException.NoNodeException {
return dataTree.getData(path, stat, watcher);
}
//org.apache.zookeeper.server.DataTree#getData
public byte[] getData(String path, Stat stat, Watcher watcher) throws KeeperException.NoNodeException {
DataNode n = nodes.get(path);
byte[] data = null;
if (n == null) {
throw new KeeperException.NoNodeException();
}
synchronized (n) {
n.copyStat(stat);
if (watcher != null) {
//完成注册 dataWatches管理了这个节点所有的watcher
dataWatches.addWatch(path, watcher);
}
data = n.data;
}
updateReadStat(path, data == null ? 0 : data.length);
return data;
}
//org.apache.zookeeper.server.watch.WatchManager#addWatch(java.lang.String, org.apache.zookeeper.Watcher)
@Override
public boolean addWatch(String path, Watcher watcher) {
return addWatch(path, watcher, WatcherMode.DEFAULT_WATCHER_MODE);
}
//org.apache.zookeeper.server.watch.WatchManager#addWatch(java.lang.String, org.apache.zookeeper.Watcher, org.apache.zookeeper.server.watch.WatcherMode)
@Override
public synchronized boolean addWatch(String path, Watcher watcher, WatcherMode watcherMode) {
if (isDeadWatcher(watcher)) {
LOG.debug("Ignoring addWatch with closed cnxn");
return false;
}
Set<Watcher> list = watchTable.get(path);
if (list == null) {
// don't waste memory if there are few watches on a node
// rehash when the 4th entry is added, doubling size thereafter
// seems like a good compromise
list = new HashSet<>(4);
//维护节点路径和watcher集合,给触发使用
watchTable.put(path, list);
}
list.add(watcher);
Set<String> paths = watch2Paths.get(watcher);
if (paths == null) {
// cnxns typically have many watches, so use default cap here
paths = new HashSet<>();
watch2Paths.put(watcher, paths);
}
watcherModeManager.setWatcherMode(watcher, path, watcherMode);
return paths.add(path);
}
再看一下怎么触发操作?写数据最终也是在org.apache.zookeeper.server.FinalRequestProcessor#processRequest上处理,以setData为例
public void processRequest(Request request) {
LOG.debug("Processing request:: {}", request);
// request.addRQRec(">final");
long traceMask = ZooTrace.CLIENT_REQUEST_TRACE_MASK;
if (request.type == OpCode.ping) {
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
if (LOG.isTraceEnabled()) {
ZooTrace.logRequest(LOG, traceMask, 'E', request, "");
}
//处理数据操作
ProcessTxnResult rc = zks.processTxn(request);
<!-- 省略 -->
}
//org.apache.zookeeper.server.ZooKeeperServer#processTxn(org.apache.zookeeper.server.Request)
// entry point for FinalRequestProcessor.java
public ProcessTxnResult processTxn(Request request) {
TxnHeader hdr = request.getHdr();
processTxnForSessionEvents(request, hdr, request.getTxn());
final boolean writeRequest = (hdr != null);
final boolean quorumRequest = request.isQuorum();
// return fast w/o synchronization when we get a read
if (!writeRequest && !quorumRequest) {
return new ProcessTxnResult();
}
synchronized (outstandingChanges) {
//处理写入DataTree
ProcessTxnResult rc = processTxnInDB(hdr, request.getTxn(), request.getTxnDigest());
// request.hdr is set for write requests, which are the only ones
// that add to outstandingChanges.
<!-- 省略 -->
}
//org.apache.zookeeper.server.ZooKeeperServer#processTxnInDB
private ProcessTxnResult processTxnInDB(TxnHeader hdr, Record txn, TxnDigest digest) {
if (hdr == null) {
return new ProcessTxnResult();
} else {
return getZKDatabase().processTxn(hdr, txn, digest);
}
}
//调用比较多,不贴非关键代码
//org.apache.zookeeper.server.ZKDatabase#processTxn
//org.apache.zookeeper.server.DataTree#processTxn(org.apache.zookeeper.txn.TxnHeader, org.apache.jute.Record, org.apache.zookeeper.txn.TxnDigest)
//org.apache.zookeeper.server.DataTree#processTxn(org.apache.zookeeper.txn.TxnHeader, org.apache.jute.Record)
//org.apache.zookeeper.server.DataTree#processTxn(org.apache.zookeeper.txn.TxnHeader, org.apache.jute.Record, boolean)
case OpCode.setData:
SetDataTxn setDataTxn = (SetDataTxn) txn;
rc.path = setDataTxn.getPath();
rc.stat = setData(
setDataTxn.getPath(),
setDataTxn.getData(),
setDataTxn.getVersion(),
header.getZxid(),
header.getTime());
break;
//org.apache.zookeeper.server.DataTree#setData
public Stat setData(String path, byte[] data, int version, long zxid, long time) throws KeeperException.NoNodeException {
Stat s = new Stat();
DataNode n = nodes.get(path);
if (n == null) {
throw new KeeperException.NoNodeException();
}
byte[] lastdata = null;
synchronized (n) {
lastdata = n.data;
nodes.preChange(path, n);
n.data = data;
n.stat.setMtime(time);
n.stat.setMzxid(zxid);
n.stat.setVersion(version);
n.copyStat(s);
nodes.postChange(path, n);
}
// now update if the path is in a quota subtree.
String lastPrefix = getMaxPrefixWithQuota(path);
long dataBytes = data == null ? 0 : data.length;
if (lastPrefix != null) {
this.updateCountBytes(lastPrefix, dataBytes - (lastdata == null ? 0 : lastdata.length), 0);
}
nodeDataSize.addAndGet(getNodeSize(path, data) - getNodeSize(path, lastdata));
updateWriteStat(path, dataBytes);
//触发watcher,type为 NodeDataChanged,dataWatches就是我们注册watcher的管理器
dataWatches.triggerWatch(path, EventType.NodeDataChanged);
return s;
}
//org.apache.zookeeper.server.watch.WatchManager#triggerWatch(java.lang.String, org.apache.zookeeper.Watcher.Event.EventType)
@Override
public WatcherOrBitSet triggerWatch(String path, EventType type) {
return triggerWatch(path, type, null);
}
@Override
public WatcherOrBitSet triggerWatch(String path, EventType type, WatcherOrBitSet supress) {
WatchedEvent e = new WatchedEvent(type, KeeperState.SyncConnected, path);
Set<Watcher> watchers = new HashSet<>();
PathParentIterator pathParentIterator = getPathParentIterator(path);
synchronized (this) {
for (String localPath : pathParentIterator.asIterable()) {
//取出路径对应的Watcher
Set<Watcher> thisWatchers = watchTable.get(localPath);
if (thisWatchers == null || thisWatchers.isEmpty()) {
continue;
}
Iterator<Watcher> iterator = thisWatchers.iterator();
while (iterator.hasNext()) {
Watcher watcher = iterator.next();
WatcherMode watcherMode = watcherModeManager.getWatcherMode(watcher, localPath);
if (watcherMode.isRecursive()) {
if (type != EventType.NodeChildrenChanged) {
watchers.add(watcher);
}
} else if (!pathParentIterator.atParentPath()) {
watchers.add(watcher);
if (!watcherMode.isPersistent()) {
iterator.remove();
Set<String> paths = watch2Paths.get(watcher);
if (paths != null) {
paths.remove(localPath);
}
}
}
}
if (thisWatchers.isEmpty()) {
watchTable.remove(localPath);
}
}
}
if (watchers.isEmpty()) {
if (LOG.isTraceEnabled()) {
ZooTrace.logTraceMessage(LOG, ZooTrace.EVENT_DELIVERY_TRACE_MASK, "No watchers for " + path);
}
return null;
}
///逐个通知客户端
for (Watcher w : watchers) {
if (supress != null && supress.contains(w)) {
continue;
}
w.process(e);
}
<!-- 省略 -->
}
从源码上看,Zookeeper主要是基于NIO网络和异步数据处理,基本上将每一步操作都拆分成单独的线程执行,服务端将连接和数据处理分开,客户端将数据读写和事件回调分开,利用异步处理提高数据吞吐量。