生产环境每隔一段时间就会出现一次长时间gc的问题,gc时间经常长达200秒,hbase日志如下
2022-03-25 16:53:46,892 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 290276ms
GC pool 'ParNew' had collection(s): count=1 time=333ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=289886ms
然后就会可能有如下后续日志
2022-03-25 16:53:47,411 ERROR [regionserver/hbase-slave-002:16021] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.MetaDataEndpointImpl, org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.MetaDataRegionObserver, org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2022-03-25 16:53:47,405 ERROR [regionserver/hbase-slave-002:16021.logRoller] regionserver.HRegionServer: ***** ABORTING region server hbase-slave-002,16021,1647600308161: IOE in log roller *****
java.io.IOException: cannot get log writer
2022-03-24 16:01:43,008 ERROR [regionserver/hbase-slave-007:16021] regionserver.HRegionServer: ***** ABORTING region server hbase-slave-007:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hbase-slave-007,16021,1647443899207 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:365)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:252)
at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:461)
at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:11087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324
2022-02-18 16:37:44,908 ERROR [main-EventThread] regionserver.HRegionServer: ***** ABORTING region server hbase-slave-007: regionserver, quorum=zookeeper-01:2181,zookeeper-02:2181,zookeeper-03:2181, baseZNode=/hbase regionserver:16021-0x7c324ef377379c received expired from ZooKeeper, aborting *****
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.connectionEvent(ZKWatcher.java:520)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.process(ZKWatcher.java:452)
at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
很明显,由于regionserver长时间的gc,gc过程中程序无响应,导致zk中的临时节点过期了,然后master认为该regionservre已经挂了,当regionserver从gc中恢复过来的时候,就会因为各种原因自己关闭自己。(具体过程有兴趣自己去看源码哈)
gc问题出现后,尝试过很多解决办法,比如调整hbase的压缩功能,修改cms的回收阈值和碎片整理参数,更换垃圾回收器,调整zk过期时间等等,均没有解决根本的gc问题
后续,在阿里hbase社区群中求助,猜测可能是由于hbase占用虚拟内存过多引起的
linux环境下,使用如下命令查看regionserve的进程id
jps
使用如下命令进行查看虚拟内存占用大小
cat /proc/[regionserver进程id]/status|grep VmSwap
结果发现regionserver的虚拟内存占用虚拟内存高达3g!
后续,让运维的同事逐步关闭了生产hbase机器的虚拟内存(交换区),gc情况就大大好转了!极少出现hbase长时间的gc的情况!