环境
CDH集群版本:5.16.2
HBase 1.2
Zookeeper 3.4.5
HBase集群主要用于JanusGraph 后端存储。
现象
2022年开始 regionserver 过一段就会出现批量掉线,日志报错如下。
regionserver 日志
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:239)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:234)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:215)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:42)
at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.deleteNode(PersistentEphemeralNode.java:315)
at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.close(PersistentEphemeralNode.java:274)
at org.apache.hive.service.server.HiveServer2$DeRegisterWatcher.process(HiveServer2.java:334)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:61)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server xxxxx zookeeper connection closed.
INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/xxxx exiting
ERROR org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine:Region server exiting
原因
通过分析zookeeper日志,发现相同的session id被分配到不同的节点,导致部分节点session id失效。从而导致以上问题。
而该问题是由zk 生成sessionid bug引起:当System.currentTimeMillis()中的第40位为1时,符号扩展将填充nextSid的前8个字节,并且id不会使会话id唯一,因此当zk大量链接时,有存在生成重复id的可能性,建议将右移改为逻辑移位。(参考:ZOOKEEPER-1622 )
org.apache.zookeeper.server.SessionTrackerImpl$SessionImpl
public static long initializeNextSession(long id) {
long nextSid = 0;
nextSid = (System.currentTimeMillis() << 24) >> 8;
nextSid = nextSid | (id <<56);
return nextSid;
}
修改为:
public static long initializeNextSession(long id) {
LOG.info("initializeNextSession 1622 patch.");
long nextSid = 0;
nextSid = (Time.currentElapsedTime() << 24) >>> 8;
nextSid = nextSid | (id <<56);
if (nextSid == Long.MIN_VALUE) {
++nextSid; // this is an unlikely edge case, but check it just in case
}
return nextSid;
}
问题修复
因zk是CDH集群自带版本,升级zk影响较大,因此采用下载源码对该类进行修改编译后,单独打包,然后把补丁优先加载解决。
下载安装好ant,进入代码目录,执行ant命令即可打包
D:\zookeeper-release-3.4.5>ant
ANT_OPTS is set to -Djava.security.manager=allow
Buildfile: D:\zookeeper-release-3.4.5\build.xml
init:
ivy-download:
ivy-taskdef:
ivy-init:
ivy-retrieve:
[ivy:retrieve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:retrieve] :: loading settings :: file = D:\zookeeper-release-3.4.5\ivysettings.xml
[ivy:retrieve] :: resolving dependencies :: org.apache.zookeeper#zookeeper;3.4.5
[ivy:retrieve] confs: [default]
[ivy:retrieve] found org.slf4j#slf4j-api;1.6.1 in maven2
[ivy:retrieve] found org.slf4j#slf4j-log4j12;1.6.1 in maven2
[ivy:retrieve] found log4j#log4j;1.2.15 in maven2
[ivy:retrieve] found jline#jline;0.9.94 in maven2
[ivy:retrieve] found org.jboss.netty#netty;3.2.2.Final in maven2
[ivy:retrieve] :: resolution report :: resolve 169ms :: artifacts dl 23ms
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 0 | 0 | 0 || 5 | 0 |
---------------------------------------------------------------------
[ivy:retrieve] :: retrieving :: org.apache.zookeeper#zookeeper
[ivy:retrieve] confs: [default]
[ivy:retrieve] 0 artifacts copied, 5 already retrieved (0kB/10ms)
clover.setup:
clover.info:
clover:
jute:
compile_jute_uptodate:
compile_jute:
ver-gen:
svn-revision:
[exec]
[exec] D:\zookeeper-release-3.4.5>echo off
[exec] 'svn' 不是内部或外部命令,也不是可运行的程序
[exec] 或批处理文件。
[exec] Result: 255
version-info:
[java] Unknown REVISION number, using -1
build-generated:
[javac] Compiling 1 source file to D:\zookeeper-release-3.4.5\build\classes
[javac] 警告: [options] 未与 -source 1.5 一起设置引导类路径
[javac] 警告: [options] 源值1.5已过时, 将在未来所有发行版中删除
[javac] 警告: [options] 目标值1.5已过时, 将在未来所有发行版中删除
[javac] 警告: [options] 要隐藏有关已过时选项的警告, 请使用 -Xlint:-options。
[javac] 4 个警告
compile:
[javac] Compiling 151 source files to D:\zookeeper-release-3.4.5\build\classes
[javac] 警告: [options] 未与 -source 1.5 一起设置引导类路径
[javac] 警告: [options] 源值1.5已过时, 将在未来所有发行版中删除
[javac] 警告: [options] 目标值1.5已过时, 将在未来所有发行版中删除
[javac] 警告: [options] 要隐藏有关已过时选项的警告, 请使用 -Xlint:-options。
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\JLineZNodeCompletor.java:33: 警告: [rawtypes] 找到原始类型: List
[javac] public int complete(String buffer, int cursor, List candidates) {
[javac] ^
[javac] 缺少泛型类List<E>的类型参数
[javac] 其中, E是类型变量:
[javac] E扩展已在接口 List中声明的Object
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\Shell.java:276: 警告: [serial] 可序列化类ExitCodeException没有 serialVersionUID 的定义
[javac] public static class ExitCodeException extends IOException {
[javac] ^
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\ZooKeeperMain.java:305: 警告: [rawtypes] 找到原始类型: Class
[javac] Class consoleC = Class.forName("jline.ConsoleReader");
[javac] ^
[javac] 缺少泛型类Class<T>的类型参数
[javac] 其中, T是类型变量:
[javac] T扩展已在类 Class中声明的Object
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\ZooKeeperMain.java:306: 警告: [rawtypes] 找到原始类型: Class
[javac] Class completorC =
[javac] ^
[javac] 缺少泛型类Class<T>的类型参数
[javac] 其中, T是类型变量:
[javac] T扩展已在类 Class中声明的Object
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\jmx\ManagedUtil.java:62: 警告: [rawtypes] 找 到原始类型: Enumeration
[javac] Enumeration enumer = r.getCurrentLoggers();
[javac] ^
[javac] 缺少泛型类Enumeration<E>的类型参数
[javac] 其中, E是类型变量:
[javac] E扩展已在接口 Enumeration中声明的Object
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\ZooKeeperServer.java:502: 警告: [rawtypes] 找到原始类型: ArrayList
[javac] acl == null ? new ArrayList<ACL>() : new ArrayList(acl));
[javac] ^
[javac] 缺少泛型类ArrayList<E>的类型参数
[javac] 其中, E是类型变量:
[javac] E扩展已在类 ArrayList中声明的Object
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\quorum\QuorumPeer.java:576: 警告: [deprecation] org.apache.zookeeper.server.quorum中的LeaderElection已过时
[javac] le = new LeaderElection(this);
[javac] ^
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\quorum\QuorumPeer.java:579: 警告: [deprecation] org.apache.zookeeper.server.quorum中的AuthFastLeaderElection已过时
[javac] le = new AuthFastLeaderElection(this);
[javac] ^
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\quorum\QuorumPeer.java:582: 警告: [deprecation] org.apache.zookeeper.server.quorum中的AuthFastLeaderElection已过时
[javac] le = new AuthFastLeaderElection(this, true);
[javac] ^
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\quorum\QuorumPeer.java:603: 警告: [deprecation] org.apache.zookeeper.server.quorum中的LeaderElection已过时
[javac] electionAlg = new LeaderElection(this);
[javac] ^
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\util\KerberosUtil.java:39: 警告: [rawtypes] 找到原始类型: Class
[javac] getInstanceMethod = classRef.getMethod("getInstance", new Class[0]);
[javac] ^
[javac] 缺少泛型类Class<T>的类型参数
[javac] 其中, T是类型变量:
[javac] T扩展已在类 Class中声明的Object
[javac] D:\zookeeper-release-3.4.5\src\java\main\org\apache\zookeeper\server\util\KerberosUtil.java:42: 警告: [rawtypes] 找到原始类型: Class
[javac] new Class[0]);
[javac] ^
[javac] 缺少泛型类Class<T>的类型参数
[javac] 其中, T是类型变量:
[javac] T扩展已在类 Class中声明的Object
[javac] 16 个警告
jar:
[jar] Building jar: D:\zookeeper-release-3.4.5\build\zookeeper-3.4.5.jar
BUILD SUCCESSFUL
Total time: 7 seconds
把补丁包拷贝到zk目录下/lib/zookeeper/build 下,根据zk启动参数确认该路径包将优先加载。
通过日志确认该修改已加载,经过线上长时间运行验证,该问题解决。