最近hbase出现了大量KeeperErrorCode = ConnectionLoss for /hbase/splitWAL 异常,而且在重启hbase的时候,没有办法启动hbase,经过仔细诊断之后发现是由于hbase的WAL文件非常多(达到30TB),导致hbase在zk的节点(存储WAL文件信息的节点)超过4096*1024 默认大小,无法正常提供服务。因此,hbase master无法正常启动。通过增加zk节点的大小参数,并且优化WAL文件,最终解决该问题。
故障现象
日志报错无法连接上zk 的 /hbase/splitWAL节点
2019-12-03 05:48:05,797 ERROR [SplitLogWorker-HDPC238160:60020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts
2019-12-03 05:48:05,798 WARN [SplitLogWorker-HDPC238160:60020] zookeeper.ZKUtil: regionserver:60020-0x16bdfc9dd27ac74, quorum=HDPC238162:2181,HDPC238160:2181,HDPC238161:2181, baseZNode=/hbase Unable to list children of znode /hbase/splitWAL
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/splitWAL
at org.apache.zookeeper.KeeperException.create(KeeperExcepti