公司的公共ZooKeeper集群(公共集群为多个业务提供服务,后续用zk代指ZooKeeper)在配置变更重启之后,某一个业务一直报zk相关的错误,直到业务方进程重启才恢复。使用的版本如下:
- Server: Zookeeper 3.4.5
- Client: Zookeeper 3.4.6, Curator 2.8.0
从Server上看到的报错日志诸如(以下日志为线下重现时的日志):
2016-11-28 23:35:24,440 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:50296
2016-11-28 23:35:24,441 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew session 0x158ab91e8910000 at /127.0.0.1:50296
2016-11-28 23:35:24,442 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session 0x158ab91e8910000 with negotiated timeout 15000 for client /127.0.0.1:50296
2016-11-28 23:35:24,454 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x158ab91e8910000 due to java.io.IOException: Len error 1073113
2016-11-28 23:35:24,455 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /127.0.0.1:50296 which had sessionid 0x158ab91e8910000
先说结论:客户端升级到3.4.8修复。再说问题触发的原因:服务端重启后,客户端在重连时发送的setWatches请求包太大,超过了Server的jute.maxbuffer
(默认为1M),Server报错Len error
,然后关闭链接;客户端继续重连,如此往复。这个Bug对应的Jira链接为ZOOKEEPER-706。
排查过程
正确的姿势
1. 根据报错查找对应的源码
想要的信息在NIOServerCnxn.java
中:
ack 'Len error'
src/java/main/org/apache/zookeeper/server/NettyServerCnxn.java
392: throw new IOException("Len error " + len);
src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java
540: throw new IOException("Len error " + len);
vim src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java +540
539 if (len < 0 || len > BinaryInputArchive.maxBuffer) {
540 throw new IOException("Len error " + len);
541 }
继续dig maxBuffer:
ack maxBuffer
src/java/main/org/apache/jute/BinaryInputArchive.java
87: static public final int maxBuffer = Integer.getInteger("jute.maxbuffer", 0xfffff);
122: // Since this is a rough sanity check, add some padding to maxBuffer to
126: if (len < 0 || len > maxBuffer + 1024) {
src/java/main/org/apache/zookeeper/ClientCnxn.java
107: * jute.maxBuffer value. To avoid this we instead split the watch
110: * with respect to the server's 1MB default for jute.maxBuffer.
可以看到maxBuffer和jute.maxbuffer
这个环境变量相关,默认值是1048575,那么现在可以确认是客户端发出来的包太大,那么为什么会这样大呢?如果你足够幸运的话,会发现ack maxBuffer
的第二个匹配,内容为:
104 /* ZOOKEEPER-706: If a session has a large number of watches set then
105 * attempting to re-establish those watches after a connection loss may
106 * fail due to the SetWatches request exceeding the server's configured
107 * jute.maxBuffer value. To avoid this we instead split the watch
108 * re-establishement across multiple SetWatches calls. This constant
109 * controls the size of each call. It is set to 128kB to be conservative
110 * with respect to the server's 1MB default for jute.maxBuffer.
111 */
112 private static final int SET_WATCHES_MAX_LENGTH = 128 * 1024;
这时候,如果足够幸运去Jira ZOOKEEPER-706看一看描述,会发现报错完全一致,剩下的就是重现、确认修复版本了。
2. 重现问题
触发条件里面比较重要的一点是:client注册的watcher比较多,所有watch的path长度超过1M。所以先创建一个Path,在这个Path下创建1024个child,每个child的name为1024字节。
private void prepareData(CuratorFramework curatorClient) throws Exception {
int childNum = 1024;
char[] chars = new char[1024];
Arrays.fill(chars, 'x');
String nameSuffix = new String(chars);
curatorClient.create().forPath(PATH);
for (int i = 0; i < childNum; i++) {
String childPath = PATH + "/" + i + "-" + nameSuffix;
curatorClient.create().forPath(childPath);
}
}
准备好数据之后,添加watcher:
private void startWatch(CuratorFramework curatorClient) throws Exception {
PathChildrenCache watcher = new PathChildrenCache(curatorClient, PATH, true);
watcher.getListenable().addListener((client, event) -> {
ChildData data = event.getData();
if (data == null) {
log.info("No data in event[" + event + "]");
} else {
log.info("Receive event: "
+ "type=[" + event.getType() + "]"
+ ", path=[" + data.getPath() + "]"
+ ", data=[" + new String(data.getData()) + "]"
+ ", stat=[" + data.getStat().toString().trim() + "]");
}
});
watcher.start(PathChildrenCache.StartMode.BUILD_INITIAL_CACHE);
}
剩下的就是重启ZooKeeper Server,然后就可以看到熟悉的错误源源不断地打印出来了。
3. 确认修复版本
从ZOOKEEPER-706里面看到修复的版本为3.4.7,对应的再看看ZooKeeper的ReleaseNote,3.4.7之后的Release版本是3.4.8。
4. 验证问题
业务方帮忙准备了两台预发的机器,A节点升级zk到3.4.8,B节点仍然使用3.4.6. 在两台机器上分别用iptables模拟zk重启:
sudo /sbin/iptables -A OUTPUT -d zkserver -j REJECT
sleep 10
sudo /sbin/iptables -D OUTPUT -d zkserver -j REJECT
A节点重连后恢复,B节点持续报错,验证成功。
错误的姿势
如果使用上述正确的姿势,整个排查过程包括复现或许半小时就搞定了;然而LZ解决这个问题耗时良久(良久还只是个遮羞布式的用词),现在说说实现发生的错误的姿势。其实经历过的很多问题排查从事后看都走了或多或少的弯路,重要的是需要尽量多地使用工具尽快地把自己从弯路上掰过来。
1. 怀疑业务方代码问题
服务端是一个公共集群,服务于多个业务,重启之后只有一个业务报错,其它业务都没有问题,所以怀疑业务方代码有问题。从事后来看,这个是因为出错的业务方watch的path太多导致的,和数据有关系,而不是代码的原因。当然,看了许久,业务代码是没问题的。
2. 怀疑业务方使用的版本问题
还是之前的思路,代码没问题就怀疑是版本的问题,找了一个很牵强的点,Curator 2.8.0里面指定依赖的是ZooKeeper 3.4.5,而业务方使用的是3.4.6。于是试图沿着这个方向去重现。结果显而易见,无法重现问题,卡壳了。
3. tcpdump抓包分析问题
由于后续又重启了Server导致问题在线上复现,有几台预发的业务机器未重启,提供了错误现场。使用tcpdump抓包分析后,可以看到
1:14:33:37.331726 IP client-ip.58029 > server-ip.eforward: Flags [S], seq 844611479, win 14600, options [mss 1460,sackOK,TS val 1820004927 ecr 0], length 0
6:14:33:37.331748 IP server-ip.eforward > client-ip.58029: Flags [S.], seq 30091353, ack 844611480, win 14480, options [mss 1460,sackOK,TS val 3857636803 ecr 1820004927], length 0
11:14:33:37.332833 IP client-ip.58029 > server-ip.eforward: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1820004928 ecr 3857636803], length 0
16:14:33:37.333766 IP client-ip.58029 > server-ip.eforward: Flags [P.], seq 1:50, ack 1, win 14600, options [nop,nop,TS val 1820004929 ecr 3857636803], length 49
24:14:33:37.333789 IP server-ip.eforward > client-ip.58029: Flags [.], ack 50, win 14480, options [nop,nop,TS val 3857636805 ecr 1820004929], length 0
29:14:33:37.334129 IP server-ip.eforward > client-ip.58029: Flags [P.], seq 1:42, ack 50, win 14480, options [nop,nop,TS val 3857636805 ecr 1820004929], length 41
36:14:33:37.335128 IP client-ip.58029 > server-ip.eforward: Flags [.], ack 42, win 14600, options [nop,nop,TS val 1820004930 ecr 3857636805], length 0
41:14:33:37.351116 IP client-ip.58029 > server-ip.eforward: Flags [P.], seq 50:14530, ack 42, win 14600, options [nop,nop,TS val 1820004946 ecr 3857636805], length 14480
951:14:33:37.351138 IP server-ip.eforward > client-ip.58029: Flags [.], ack 14530, win 17376, options [nop,nop,TS val 3857636822 ecr 1820004946], length 0
956:14:33:37.351586 IP server-ip.eforward > client-ip.58029: Flags [F.], seq 42, ack 14530, win 17376, options [nop,nop,TS val 3857636823 ecr 1820004946], length 0
961:14:33:37.351616 IP server-ip.eforward > client-ip.58029: Flags [R.], seq 43, ack 14530, win 17376, options [nop,nop,TS val 3857636823 ecr 1820004946], length 0
966:14:33:37.352159 IP client-ip.58029 > server-ip.eforward: Flags [.], seq 14530:15978, ack 42, win 14600, options [nop,nop,TS val 1820004947 ecr 3857636822], length 1448
1061:14:33:37.352173 IP server-ip.eforward > client-ip.58029: Flags [R], seq 30091395, win 0, length 0
1065:14:33:37.352207 IP client-ip.58029 > server-ip.eforward: Flags [P.], seq 15978:21770, ack 42, win 14600, options [nop,nop,TS val 1820004947 ecr 3857636822], length 5792
1432:14:33:37.352214 IP server-ip.eforward > client-ip.58029: Flags [R], seq 30091395, win 0, length 0
1436:14:33:37.352322 IP client-ip.58029 > server-ip.eforward: Flags [.], seq 21770:23218, ack 42, win 14600, options [nop,nop,TS val 1820004948 ecr 3857636822], length 1448
1531:14:33:37.352329 IP server-ip.eforward > client-ip.58029: Flags [R], seq 30091395, win 0, length 0
1535:14:33:37.352375 IP client-ip.58029 > server-ip.eforward: Flags [.], seq 23218:29010, ack 42, win 14600, options [nop,nop,TS val 1820004948 ecr 3857636822], length 5792
1902:14:33:37.352381 IP server-ip.eforward > client-ip.58029: Flags [R], seq 30091395, win 0, length 0
1906:14:33:37.352420 IP client-ip.58029 > server-ip.eforward: Flags [.], seq 29010:30458, ack 42, win 14600, options [nop,nop,TS val 1820004948 ecr 3857636822], length 1448
2001:14:33:37.352426 IP server-ip.eforward > client-ip.58029: Flags [R], seq 30091395, win 0, length 0
从行数来看:
第1、6、11行是三次握手;
第16行是客户端发ConnectRequest,第29行是服务端回ConnectResponse;
第41行是客户端发了一个很大的包,还被拆包了,服务端收到之后就发了FIN包,然后后面的请求就都RESET掉了
从内容看,第41行发的请求包,type为101(按照zk的协议,除了ConnectRequest,每个请求包的包头都会有type来标志请求包的类型),根据src/java/main/org/apache/zookeeper/ZooDefs.java
中的代码显示,是setWatches请求。
这时候大概有些眉目了,应该是setWatches请求包太大导致的错误,于是转入了上述“正确的姿势”。
遗留问题
- Q:为什么重启的时候不会有这个问题,而重连的时候会有这个问题?
- A:重新启动的逻辑和重连的逻辑不一样。重启的时候会首先调用
getChildren2
去获取path的子节点,然后逐一去setWatches。重连时会把之前的watches批量set一遍。
总结
- 尽量先翻代码。先Google固然也是好习惯,但是容易陷入寻找类似问题的泥沼中,特别是在关键词挑选和组织得不是很好的情况下。
- 维护的开源产品还是需要多看看upstream的更新。