故障起因:
跑kylin任务,过程出错,异常信息:Direct buffer memory,
java.io.IOException: java.lang.OutOfMemoryError: Direct buffer memory
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleException(HRegion.java:5607)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5579)
at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2627)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2613)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2595)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2282)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
重启HBase所有服务后,启动正常,一分钟后ambari显示,所有region server正常,但是active master和standby master全部挂掉。仍然报错:Direct buffer memory,修改hbase-env文件中的HBase off-heap MaxDirectMemorySize参数从4G调大到6G,重启HBase后,Direct buffer memory异常消失。但某些节点的系统meta表一直处于RIT状态。
症状描述:
1、region in transation
Ambari 显示HBase master 正常启动,hbase master ui :红色警告hbase:meta region in transation,在其中一个数据节点,且一直在持续,估计HBase 元数据文件损坏,已经落入永久RIT状态
2、查看region server报错
去处于RIT状态的region server查看实时log,主要有三种报错:
- access denied
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=hbase, access=WRITE, inode="/apps/hbase/data/archive/data/GatXtcYysLcsk/KYLIN_BM5CEA4Y43/473ee8eb537051873792fdb417f866ac/F1":root:hdfs:drwxr-xr-x
- too many open files
2018-03-20 21:23:36,518 WARN [62309924@qtp-666312528-1 - Acceptor0 SelectChannelConnector@0.0.0.0:16030] mortbay.log: EXCEPTION
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at org.mortbay.jetty.nio.SelectChannelConnector$1.acceptChannel(SelectChannelConnector.java:75)
at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:695)
at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
查看hbase用户、hdfs用户的ulimit -a 发现open_files并不小,hbase open_files=32000应该够用。
3、hbase shell 执行status 报错: HBase master failed to initization
4、执行
su hbase
hbase hbck -fixMeta
重试35次后,仍然报错:
2018-03-19 19:09:36,553 FATAL [hdmaster3:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Failed to get result within timeout, timeout=60000ms
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:206)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:327)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:302)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:167)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)
at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
at org.apache.hadoop.hbase.MetaTableAccessor.fullScanOfMeta(MetaTableAccessor.java:143)
at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.isMetaTableUpdated(MetaMigrationConvertingToPB.java:163)
at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.updateMetaIfNecessary(MetaMigrationConvertingToPB.java:130)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:824)
at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:214)
at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1884)
at java.lang.Thread.run(Thread.java:745)
调整参数:hbase.client.scanner.timeout.period,由默认60s改为600s,等待600s后master仍然会挂掉,所以考虑根本原因还是数据损坏。
解决途径:
1、停止hbase所有服务
2、执行命令:
hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
离线修复meta元数据,因为hbase hbck -fixMeta只能在master正常,数据表丢失的情况下使用,master初始化失败时,无法使用。
在此命令执行过程中,先后出过多个异常:
1)kylin_meta表的region中有重复starttime,转移meta数据
2)WALS文件报错,转移到其他位置
3)成功启动master后,多数region上线,但仍有个别非meta表的region 处于RIT状态,强行删除后,过一段时间,所有region上线,表数据恢复。
3、删除hbase znode,进入zookeeper 命令端,删除hbase znode
4、启动hbase
5、再有问题,可以执行:
hbase hbck -repair修复数据
总结
这次的解决过程非常痛苦,生产环境出问题三天,在查master和region server 日志过程中走了很多弯路,也没有找对方向,最终数据恢复了,但丢了kylin的元数据表,所有的cube都重新build的,要奉劝大家的是kylin的元数据要经常备份,至此虽然服务都起来了,但没有找到引起这个问题的真正原因,或许的kylin本身的bug,或许是多次不停的挂掉重启,导致meta表损坏、异常。