可能我们直接使用 zookeeper 的机会并不多,甚至都不会直接去使用,但是 zookeeper 作为分布式协调框架,在如今分布式系统领域有着非常大的作用,很多流行框架都有使用它,如分布式治理框架 dubbo,大数据领域中的 hbase,所以了解 zookeeper 是非常有必要的。
此篇文章是从源码的角度去了解下底层是怎么连接 zookeeper 的,肯定感觉很奇怪,一个连接操作有啥好了解的,但是你看了此篇文章,zookeeper 一个简单的连接操作其实做了很多事情。我们在使用 zookeeper 的时候,一般都是以如下方式去连接 zookeeper 集群:
public ZooKeeper connect(String connStr) throws IOException {
return new ZooKeeper(connStr, 3000, new Watcher() {
@Override
public void process(WatchedEvent watchedEvent) {
if (Event.KeeperState.SyncConnected == watchedEvent.getState()) {
countDownLatch.countDown();
System.out.println("connect zk...");
}
}
});
}
是的,直接 new Zookeeper() 的方式去连接 zookeeper 集群,这里需要注意的是zookeeper 连接是异步操作。那么其构造函数做了什么事情呢?
public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher,
boolean canBeReadOnly)
throws IOException
{
LOG.info("Initiating client connection, connectString=" + connectString
+ " sessionTimeout=" + sessionTimeout + " watcher=" + watcher);
// ZKWatchManager 管理监听器(watcher)以及处理由客户端(ClientCnxn)产生的事件
watchManager.defaultWatcher = watcher;
// 解析 zookeeper 集群地址字符串,但并不对地址进行解析
ConnectStringParser connectStringParser = new ConnectStringParser(
connectString);
// 解析地址
HostProvider hostProvider = new StaticHostProvider(
connectStringParser.getServerAddresses());
// 创建连接对象,但是并不真正建立连接,而是在需要的时候才建立
cnxn = new ClientCnxn(connectStringParser.getChrootPath(),
hostProvider, sessionTimeout, this, watchManager,
getClientCnxnSocket(), canBeReadOnly);
cnxn.start();
}
其中最重要的代码是最后两行,我们先看其中第一行语句会做些什么。
/**
* 创建 ClientCnxn 对象(连接对象),调用它的构造函数之后,随后就得调用它的 start() 方法
*
* @param chrootPath - the chroot of this client. Should be removed from this Class in ZOOKEEPER-838
* @param hostProvider
* zookeeper 集群地址列表
* @param sessionTimeout
* 连接超时时间
* @param zooKeeper
* 同此 ClientCnxn 关联的 Zookeeper 对象
* @param watcher 对连接进行监听的监听器
* @param clientCnxnSocket
* 网络套接字(支持 nio / netty)
* @param sessionId 回话 id
* @param sessionPasswd 回话密码
* @param canBeReadOnly
* 此连接只可以进行读操作
* @throws IOException
*/
public ClientCnxn(String chrootPath, HostProvider hostProvider, int sessionTimeout, ZooKeeper zooKeeper,
ClientWatchManager watcher, ClientCnxnSocket clientCnxnSocket,
long sessionId, byte[] sessionPasswd, boolean canBeReadOnly) {
this.zooKeeper = zooKeeper;
this.watcher = watcher;
this.sessionId = sessionId;
this.sessionPasswd = sessionPasswd;
this.sessionTimeout = sessionTimeout;
this.hostProvider = hostProvider;
this.chrootPath = chrootPath;
// 计算连接超时时间和读操作超时时间
connectTimeout = sessionTimeout / hostProvider.size();
readTimeout = sessionTimeout * 2 / 3;
readOnly = canBeReadOnly;
// 线程对象,两个都为守护线程
// 设置状态为 CONNECTING
sendThread = new SendThread(clientCnxnSocket);
eventThread = new EventThread();
}
可以看到此构造函数的参数众多,我们需要关注的只有几个,一个 Zookeeper 对象,一个代表网络套接字的 ClientCnxnSocket 对象,一个是 sessionId。其他的可以忽略。而此构造函数貌似没有做什么事情,就是简单赋值,但是其中有两个线程对象很终于,一个是 sendThread,一个是 eventThread,从名字上感觉 sendThread 线程专门负责网络的连接和读取操作,eventThread 线程专门负责对事件的处理。在这里没有看到对这两个线程进行启动,而方法注释有说在调用它之后随后就调用 ClientCnxn 对象的 start() 方法。
public void start() {
sendThread.start();
eventThread.start();
}
这个方法就是启动刚才说的两个线程,我们首先看看 SendThread 对象的 run() 方法。
@Override
public void run() {
// 对 ClientCnxnSocket 对象进行一些初始化操作
clientCnxnSocket.introduce(this,sessionId);
// 设置当前时间
clientCnxnSocket.updateNow();
// 设置最近发送时间和心跳时间
clientCnxnSocket.updateLastSendAndHeard();
int to;
// 最近 ping 的时间
long lastPingRwServer = System.currentTimeMillis();
// 时间间隔
final int MAX_SEND_PING_INTERVAL = 10000; //10 seconds
// state != CLOSED && state != AUTH_FAILED
while (state.isAlive()) {
try {
// 如果还没有建立连接
if (!clientCnxnSocket.isConnected()) {
if(!isFirstConnect){
// 不是第一次建立连接的回话,先随意睡眠一会
try {
Thread.sleep(r.nextInt(1000));
} catch (InterruptedException e) {
LOG.warn("Unexpected exception", e);
}
}
// 不重复建立连接(关闭回话时)
if (closing || !state.isAlive()) {
break;
}
// 敲黑板语句,开始连接
startConnect();
clientCnxnSocket.updateLastSendAndHeard();
}
// 已经建立连接
if (state.isConnected()) {
// determine whether we need to send an AuthFailed event.
if (zooKeeperSaslClient != null) {
boolean sendAuthEvent = false;
if (zooKeeperSaslClient.getSaslState() == ZooKeeperSaslClient.SaslState.INITIAL) {
try {
zooKeeperSaslClient.initialize(ClientCnxn.this);
} catch (SaslException e) {
LOG.error("SASL authentication with Zookeeper Quorum member failed: " + e);
state = States.AUTH_FAILED;
sendAuthEvent = true;
}
}
KeeperState authState = zooKeeperSaslClient.getKeeperState();
if (authState != null) {
if (authState == KeeperState.AuthFailed) {
// An authentication error occurred during authentication with the Zookeeper Server.
state = States.AUTH_FAILED;
sendAuthEvent = true;
} else {
if (authState == KeeperState.SaslAuthenticated) {
sendAuthEvent = true;
}
}
}
if (sendAuthEvent == true) {
eventThread.queueEvent(new WatchedEvent(
Watcher.Event.EventType.None,
authState,null));
}
}
to = readTimeout - clientCnxnSocket.getIdleRecv();
} else {
to = connectTimeout - clientCnxnSocket.getIdleRecv();
}
if (to <= 0) {
throw new SessionTimeoutException(
"Client session timed out, have not heard from server in "
+ clientCnxnSocket.getIdleRecv() + "ms"
+ " for sessionid 0x"
+ Long.toHexString(sessionId));
}
// 已建立连接
if (state.isConnected()) {
//1000(1 second) is to prevent race condition missing to send the second ping
//also make sure not to send too many pings when readTimeout is small
int timeToNextPing = readTimeout / 2 - clientCnxnSocket.getIdleSend() -
((clientCnxnSocket.getIdleSend() > 1000) ? 1000 : 0);
//send a ping request either time is due or no packet sent out within MAX_SEND_PING_INTERVAL
if (timeToNextPing <= 0 || clientCnxnSocket.getIdleSend() > MAX_SEND_PING_INTERVAL) {
// 发送心跳
sendPing();
clientCnxnSocket.updateLastSend();
} else {
if (timeToNextPing < to) {
to = timeToNextPing;
}
}
}
// If we are in read-only mode, seek for read/write server
if (state == States.CONNECTEDREADONLY) {
long now = System.currentTimeMillis();
int idlePingRwServer = (int) (now - lastPingRwServer);
if (idlePingRwServer >= pingRwTimeout) {
lastPingRwServer = now;
idlePingRwServer = 0;
pingRwTimeout =
Math.min(2*pingRwTimeout, maxPingRwTimeout);
pingRwServer();
}
to = Math.min(to, pingRwTimeout - idlePingRwServer);
}
// 重点
clientCnxnSocket.doTransport(to, pendingQueue, outgoingQueue, ClientCnxn.this);
} catch (Throwable e) {
if (closing) {
if (LOG.isDebugEnabled()) {
// closing so this is expected
LOG.debug("An exception was thrown while closing send thread for session 0x"
+ Long.toHexString(getSessionId())
+ " : " + e.getMessage());
}
break;
} else {
// this is ugly, you have a better way speak up
if (e instanceof SessionExpiredException) {
LOG.info(e.getMessage() + ", closing socket connection");
} else if (e instanceof SessionTimeoutException) {
LOG.info(e.getMessage() + RETRY_CONN_MSG);
} else if (e instanceof EndOfStreamException) {
LOG.info(e.getMessage() + RETRY_CONN_MSG);
} else if (e instanceof RWServerFoundException) {
LOG.info(e.getMessage());
} else {
LOG.warn(
"Session 0x"
+ Long.toHexString(getSessionId())
+ " for server "
+ clientCnxnSocket.getRemoteSocketAddress()
+ ", unexpected error"
+ RETRY_CONN_MSG, e);
}
cleanup();
if (state.isAlive()) {
eventThread.queueEvent(new WatchedEvent(
Event.EventType.None,
Event.KeeperState.Disconnected,
null));
}
clientCnxnSocket.updateNow();
clientCnxnSocket.updateLastSendAndHeard();
}
}
} // ending while
cleanup();
clientCnxnSocket.close();
if (state.isAlive()) {
eventThread.queueEvent(new WatchedEvent(Event.EventType.None,
Event.KeeperState.Disconnected, null));
}
ZooTrace.logTraceMessage(LOG, ZooTrace.getTextTraceLevel(),
"SendThread exitedloop.");
}
方法体很长,各种 if 判断,我们假设是第一次开始建立连接,那么首先关注的一行代码是:startConnect(),我们看看 SendThread 类的 startConnect() 方法是怎么开始建立连接的。
private void startConnect() throws IOException {
// 状态设置为 CONNECTING
state = States.CONNECTING;
InetSocketAddress addr;
if (rwServerAddress != null) {
addr = rwServerAddress;
rwServerAddress = null;
} else {
// 获取下一个可连接的服务端
addr = hostProvider.next(1000);
}
// 设置线程名
setName(getName().replaceAll("\\(.*\\)",
"(" + addr.getHostName() + ":" + addr.getPort() + ")"));
if (ZooKeeperSaslClient.isEnabled()) {
try {
String principalUserName = System.getProperty(
ZK_SASL_CLIENT_USERNAME, "zookeeper");
zooKeeperSaslClient =
new ZooKeeperSaslClient(
principalUserName+"/"+addr.getHostName());
} catch (LoginException e) {
// An authentication error occurred when the SASL client tried to initialize:
// for Kerberos this means that the client failed to authenticate with the KDC.
// This is different from an authentication error that occurs during communication
// with the Zookeeper server, which is handled below.
LOG.warn("SASL configuration failed: " + e + " Will continue connection to Zookeeper server without "
+ "SASL authentication, if Zookeeper server allows it.");
eventThread.queueEvent(new WatchedEvent(
Watcher.Event.EventType.None,
Watcher.Event.KeeperState.AuthFailed, null));
saslLoginFailed = true;
}
}
logStartConnect(addr);
// 使用套接字建立连接
clientCnxnSocket.connect(addr);
}
注意最后一行代码,真正去使用套接字建立远程连接,这里我们拿 nio 的实现 ClientCnxnSocketNIO 为例进行说明。
@Override
void connect(InetSocketAddress addr) throws IOException {
// 创建 SocketChannel
SocketChannel sock = createSock();
try {
// 往 Selector 注册 SocketChannel,注册的 key 为 SelectionKey.OP_CONNECT
registerAndConnect(sock, addr);
} catch (IOException e) {
LOG.error("Unable to open socket to " + addr);
sock.close();
throw e;
}
initialized = false;
/*
* Reset incomingBuffer
*/
lenBuffer.clear();
incomingBuffer = lenBuffer;
}
此方法并没有改变客户端的连接状态,还是 CONNECTING 状态,因此接下来需要注意 前面 run() 方法中的代码是:clientCnxnSocket.doTransport(to, pendingQueue, outgoingQueue, ClientCnxn.this) 。
@Override
void doTransport(int waitTimeOut, List<Packet> pendingQueue, LinkedList<Packet> outgoingQueue,
ClientCnxn cnxn)
throws IOException, InterruptedException {
selector.select(waitTimeOut);
Set<SelectionKey> selected;
synchronized (this) {
selected = selector.selectedKeys();
}
// Everything below and until we get back to the select is
// non blocking, so time is effectively a constant. That is
// Why we just have to do this once, here
updateNow();
for (SelectionKey k : selected) {
SocketChannel sc = ((SocketChannel) k.channel());
if ((k.readyOps() & SelectionKey.OP_CONNECT) != 0) {
if (sc.finishConnect()) { // 注意此处
updateLastSendAndHeard();
sendThread.primeConnection();
}
} else if ((k.readyOps() & (SelectionKey.OP_READ | SelectionKey.OP_WRITE)) != 0) {
doIO(pendingQueue, outgoingQueue, cnxn);
}
}
if (sendThread.getZkState().isConnected()) {
synchronized(outgoingQueue) {
if (findSendablePacket(outgoingQueue,
cnxn.sendThread.clientTunneledAuthenticationInProgress()) != null) {
enableWrite();
}
}
}
selected.clear();
}
此方法中,如果我们是建立连接的话,有个方法调用需要注意,就是 sc.finishConnect(),在前面 connect() 方法中有对一个方法进行调用:registerAndConnect(sock, addr),它里面配置了 SocketChannel 为非阻塞模式,并调用了 SocketChannel 类的 connect() 方法,**如果 SocketChannel 在非阻塞模式下,此时调用 connect(),该方法可能在连接建立之前就返回了。为了确定连接是否建立,可以调用 finishConnect() 的方法。**因此,这里 finishConnect() 方法调用要么返回 true,要么就是抛出异常。返回 true 的话,就说明跟服务端已经建立了连接,可以发送数据了,我们看看 primeConnection() 方法的逻辑。
void primeConnection() throws IOException {
LOG.info("Socket connection established to "
+ clientCnxnSocket.getRemoteSocketAddress()
+ ", initiating session");
isFirstConnect = false; // 设置标志
long sessId = (seenRwServerBefore) ? sessionId : 0;
// 构建连接请求
ConnectRequest conReq = new ConnectRequest(0, lastZxid,
sessionTimeout, sessId, sessionPasswd);
synchronized (outgoingQueue) {
// We add backwards since we are pushing into the front
// Only send if there's a pending watch
// TODO: here we have the only remaining use of zooKeeper in
// this class. It's to be eliminated!
if (!disableAutoWatchReset) {
List<String> dataWatches = zooKeeper.getDataWatches();
List<String> existWatches = zooKeeper.getExistWatches();
List<String> childWatches = zooKeeper.getChildWatches();
if (!dataWatches.isEmpty()
|| !existWatches.isEmpty() || !childWatches.isEmpty()) {
SetWatches sw = new SetWatches(lastZxid,
prependChroot(dataWatches),
prependChroot(existWatches),
prependChroot(childWatches));
RequestHeader h = new RequestHeader();
h.setType(ZooDefs.OpCode.setWatches);
h.setXid(-8);
Packet packet = new Packet(h, new ReplyHeader(), sw, null, null);
outgoingQueue.addFirst(packet);
}
}
for (AuthData id : authInfo) {
outgoingQueue.addFirst(new Packet(new RequestHeader(-4,
OpCode.auth), null, new AuthPacket(0, id.scheme,
id.data), null, null));
}
// 把连接请求放入队列中,outgoingQueue 是一个 LinkedList 队列,持有发送且还没有被响应的请求
outgoingQueue.addFirst(new Packet(null, null, conReq,
null, null, readOnly));
}
// SocketChannel 开启读写操作
clientCnxnSocket.enableReadWriteOnly();
}
可以看到 Zookeeper 会把所有请求使用一个 Packet 对象包装起来,然后放入一个队列中。那么这个队列的请求是什么时候发送出去的呢?如果连接还没有超时,那么之后还是会进入刚刚说的 doTransport() 方法,而此时由于注册了 read 和 write 操作,因此可以看到会调用 doIO() 方法。
void doIO(List<Packet> pendingQueue, LinkedList<Packet> outgoingQueue, ClientCnxn cnxn)
throws InterruptedException, IOException {
SocketChannel sock = (SocketChannel) sockKey.channel();
if (sock == null) {
throw new IOException("Socket is null!");
}
// 读
if (sockKey.isReadable()) {
int rc = sock.read(incomingBuffer);
if (rc < 0) {
throw new EndOfStreamException(
"Unable to read additional data from server sessionid 0x"
+ Long.toHexString(sessionId)
+ ", likely server has closed socket");
}
if (!incomingBuffer.hasRemaining()) {
incomingBuffer.flip();
if (incomingBuffer == lenBuffer) {
recvCount++;
readLength();
} else if (!initialized) {
// 读取对连接请求的响应
readConnectResult();
enableRead();
if (findSendablePacket(outgoingQueue,
cnxn.sendThread.clientTunneledAuthenticationInProgress()) != null) {
// Since SASL authentication has completed (if client is configured to do so),
// outgoing packets waiting in the outgoingQueue can now be sent.
enableWrite();
}
lenBuffer.clear();
incomingBuffer = lenBuffer;
updateLastHeard();
initialized = true;
} else {
// 读取服务端响应
sendThread.readResponse(incomingBuffer);
lenBuffer.clear();
incomingBuffer = lenBuffer;
updateLastHeard();
}
}
}
// 写
if (sockKey.isWritable()) {
// 写
synchronized(outgoingQueue) {
// 取待发送请求
Packet p = findSendablePacket(outgoingQueue,
cnxn.sendThread.clientTunneledAuthenticationInProgress());
if (p != null) {
updateLastSend();
// If we already started writing p, p.bb will already exist
if (p.bb == null) {
if ((p.requestHeader != null) &&
(p.requestHeader.getType() != OpCode.ping) &&
(p.requestHeader.getType() != OpCode.auth)) {
p.requestHeader.setXid(cnxn.getXid());
}
p.createBB();
}
// 向服务端写消息
sock.write(p.bb);
if (!p.bb.hasRemaining()) {
sentCount++;
// 删除已发送的请求
outgoingQueue.removeFirstOccurrence(p);
if (p.requestHeader != null
&& p.requestHeader.getType() != OpCode.ping
&& p.requestHeader.getType() != OpCode.auth) {
synchronized (pendingQueue) {
pendingQueue.add(p);
}
}
}
}
if (outgoingQueue.isEmpty()) {
// No more packets to send: turn off write interest flag.
// Will be turned on later by a later call to enableWrite(),
// from within ZooKeeperSaslClient (if client is configured
// to attempt SASL authentication), or in either doIO() or
// in doTransport() if not.
disableWrite();
} else if (!initialized && p != null && !p.bb.hasRemaining()) {
// On initial connection, write the complete connect request
// packet, but then disable further writes until after
// receiving a successful connection response. If the
// session is expired, then the server sends the expiration
// response and immediately closes its end of the socket. If
// the client is simultaneously writing on its end, then the
// TCP stack may choose to abort with RST, in which case the
// client would never receive the session expired event. See
// http://docs.oracle.com/javase/6/docs/technotes/guides/net/articles/connection_release.html
disableWrite();
} else {
// Just in case
enableWrite();
}
}
}
}
此方法是真正处理网络 I/O 读写操作的地方,可以看到有向服务端发送请求的逻辑,也有读取服务端返回响应的逻辑。
我们关注下发起连接的响应逻辑。
void readConnectResult() throws IOException {
ByteBufferInputStream bbis = new ByteBufferInputStream(incomingBuffer);
BinaryInputArchive bbia = BinaryInputArchive.getArchive(bbis);
ConnectResponse conRsp = new ConnectResponse();
conRsp.deserialize(bbia, "connect");
// read "is read-only" flag
boolean isRO = false;
try {
isRO = bbia.readBool("readOnly");
} catch (IOException e) {
// this is ok -- just a packet from an old server which
// doesn't contain readOnly field
LOG.warn("Connected to an old server; r-o mode will be unavailable");
}
this.sessionId = conRsp.getSessionId();
// 建立连接后的回调函数
sendThread.onConnected(conRsp.getTimeOut(), this.sessionId,
conRsp.getPasswd(), isRO);
}
void onConnected(int _negotiatedSessionTimeout, long _sessionId,
byte[] _sessionPasswd, boolean isRO) throws IOException {
negotiatedSessionTimeout = _negotiatedSessionTimeout;
// 回话超时
if (negotiatedSessionTimeout <= 0) {
state = States.CLOSED;
eventThread.queueEvent(new WatchedEvent(
Watcher.Event.EventType.None,
Watcher.Event.KeeperState.Expired, null));
eventThread.queueEventOfDeath();
throw new SessionExpiredException(
"Unable to reconnect to ZooKeeper service, session 0x"
+ Long.toHexString(sessionId) + " has expired");
}
if (!readOnly && isRO) {
LOG.error("Read/write client got connected to read-only server");
}
readTimeout = negotiatedSessionTimeout * 2 / 3;
connectTimeout = negotiatedSessionTimeout / hostProvider.size();
hostProvider.onConnected();
sessionId = _sessionId;
sessionPasswd = _sessionPasswd;
state = (isRO) ?
States.CONNECTEDREADONLY : States.CONNECTED;
seenRwServerBefore |= !isRO;
LOG.info("Session establishment complete on server "
+ clientCnxnSocket.getRemoteSocketAddress()
+ ", sessionid = 0x" + Long.toHexString(sessionId)
+ ", negotiated timeout = " + negotiatedSessionTimeout
+ (isRO ? " (READ-ONLY mode)" : ""));
KeeperState eventState = (isRO) ?
KeeperState.ConnectedReadOnly : KeeperState.SyncConnected;
// 连接事件
eventThread.queueEvent(new WatchedEvent(
Watcher.Event.EventType.None,
eventState, null));
}
至此,基本上对 Zookeeper 连接过程是有一定的了解了,整个过程如下图所示:
这个过程中可能涉及到的对象不是很多,但是都是概念性比较强,而且有些相互之间有依赖,因此也粗略的捋了下围绕着 ClientCnxn 类的类图。