记录偶现的hbase长时间gc的问题和解决方法

博客讲述了在生产环境中遇到HBase RegionServer频繁出现长时间GC,导致服务中断的问题。通过日志分析,确认GC期间RegionServer暂停服务,进而引发Zookeeper会话过期,最终导致RegionServer自我关闭。尝试多种优化措施无效后,发现RegionServer进程占用虚拟内存过高,关闭虚拟内存(交换区)后,GC问题显著改善。这提示内存管理可能是解决问题的关键。
摘要由CSDN通过智能技术生成

生产环境每隔一段时间就会出现一次长时间gc的问题,gc时间经常长达200秒,hbase日志如下

2022-03-25 16:53:46,892 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 290276ms
GC pool 'ParNew' had collection(s): count=1 time=333ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=289886ms

然后就会可能有如下后续日志

2022-03-25 16:53:47,411 ERROR [regionserver/hbase-slave-002:16021] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.MetaDataEndpointImpl, org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.MetaDataRegionObserver, org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2022-03-25 16:53:47,405 ERROR [regionserver/hbase-slave-002:16021.logRoller] regionserver.HRegionServer: ***** ABORTING region server hbase-slave-002,16021,1647600308161: IOE in log roller *****
java.io.IOException: cannot get log writer
2022-03-24 16:01:43,008 ERROR [regionserver/hbase-slave-007:16021] regionserver.HRegionServer: ***** ABORTING region server hbase-slave-007: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hbase-slave-007,16021,1647443899207 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:365)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:252)
	at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:461)
	at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:11087)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324

2022-02-18 16:37:44,908 ERROR [main-EventThread] regionserver.HRegionServer: ***** ABORTING region server hbase-slave-007: regionserver, quorum=zookeeper-01:2181,zookeeper-02:2181,zookeeper-03:2181, baseZNode=/hbase regionserver:16021-0x7c324ef377379c received expired from ZooKeeper, aborting *****
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
	at org.apache.hadoop.hbase.zookeeper.ZKWatcher.connectionEvent(ZKWatcher.java:520)
	at org.apache.hadoop.hbase.zookeeper.ZKWatcher.process(ZKWatcher.java:452)
	at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)

很明显,由于regionserver长时间的gc,gc过程中程序无响应,导致zk中的临时节点过期了,然后master认为该regionservre已经挂了,当regionserver从gc中恢复过来的时候,就会因为各种原因自己关闭自己。(具体过程有兴趣自己去看源码哈)

gc问题出现后,尝试过很多解决办法,比如调整hbase的压缩功能,修改cms的回收阈值和碎片整理参数,更换垃圾回收器,调整zk过期时间等等,均没有解决根本的gc问题

后续,在阿里hbase社区群中求助,猜测可能是由于hbase占用虚拟内存过多引起的

linux环境下,使用如下命令查看regionserve的进程id

jps

使用如下命令进行查看虚拟内存占用大小

cat /proc/[regionserver进程id]/status|grep VmSwap

结果发现regionserver的虚拟内存占用虚拟内存高达3g!

后续,让运维的同事逐步关闭了生产hbase机器的虚拟内存(交换区),gc情况就大大好转了!极少出现hbase长时间的gc的情况!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值