HBase调优之GC超时

1. HBase GC时间过长

1.1 问题描述

...
2018-03-01 17:32:16,243 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 16100ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=16378ms

1.2 当前配置

  1. 客户端 Java 堆大小(字节): 1G
  2. HBase REST Server 的 Java 堆栈大小(字节): 1G
  3. HBase Thrift Server 的 Java 堆栈大小(字节): 1G
  4. HBase Master 的 Java 堆栈大小(字节): 2G
  5. HBase RegionServer 的 Java 堆栈大小(字节): 8G
  6. 客户端 Java 配置选项:
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Djava.net.preferIPv4Stack=true
  1. HBase RegionServer 的 Java 配置选项
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
  1. hbase-env.sh 的 HBase 客户端环境高级配置代码段(安全阀):
HBASE_OFFHEAPSIZE=5G
HBASE_OPTS=-XX:MaxDirectMemorySize=18G
  1. BucketCache IOEngine
hbase.bucketcache.ioengine = offheap
  1. BucketCache 大小
hbase.bucketcache.size = 4G

从当前配置中可以看出,HBase的RegionServer java heap size为8G,使用的垃圾回收机制为(old) CMS + (new) ParNew收集器.

2. 调优过程

2.1 打印GC日志

HBase RegionServer 的 Java 配置选项:

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 

主要新增最后两个,打印GC日志和时间戳.结果如下:

6.441: [GC (CMS Initial Mark) [1 CMS-initial-mark: 0K(6855104K)] 662500K(8235264K), 0.1886994 secs] [Times: user=0.39 sys=0.09, real=0.19 secs] 
6.615: [CMS-concurrent-mark-start]
6.617: [CMS-concurrent-mark: 0.002/0.002 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
6.618: [CMS-concurrent-preclean-start]
6.649: [CMS-concurrent-preclean: 0.032/0.032 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] 
6.650: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 11.757: [CMS-concurrent-abortable-preclean: 1.883/5.108 secs] [Times: user=5.99 sys=0.21, real=5.10 secs] 
11.758: [GC (CMS Final Remark) [YG occupancy: 981477 K (1380160 K)]11.758: [Rescan (parallel) , 0.5508233 secs]12.309: [weak refs processing, 0.0000426 secs]12.309: [class unloading, 0.0197898 secs]12.329: [scrub symbol table, 0.0108038 secs]12.340: [scrub string table, 0.0009889 secs][1 CMS-remark: 0K(6855104K)] 981477K(8235264K), 0.5841155 secs] [Times: user=8.36 sys=1.42, real=0.58 secs] 
12.343: [CMS-concurrent-sweep-start]
12.343: [CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
12.343: [CMS-concurrent-reset-start]
12.407: [CMS-concurrent-reset: 0.064/0.064 secs] [Times: user=0.02 sys=0.05, real=0.06 secs] 
75.744: [GC (Allocation Failure) 75.744: [ParNew: 1226816K->38190K(1380160K), 0.0605172 secs] 1226816K->38190K(8235264K), 0.0606329 secs] [Times: user=0.30 sys=0.17, real=0.06 secs] 
119.831: [GC (Allocation Failure) 119.832: [ParNew: 1265006K->54410K(1380160K), 0.0673169 secs] 1265006K->54410K(8235264K), 0.0675545 secs] [Times: user=0.69 sys=0.19, real=0.07 secs] 
131.722: [GC (Allocation Failure) 131.723: [ParNew: 1281226K->128075K(1380160K), 0.0756202 secs] 1281226K->128075K(8235264K), 0.0764645 secs] [Times: user=0.81 sys=0.33, real=0.08 secs] 
138.073: [GC (Allocation Failure) 138.073: [ParNew: 1354891K->153344K(1380160K), 0.3575232 secs] 1354891K->185778K(8235264K), 0.3578412 secs] [Times: user=1.52 sys=0.25, real=0.36 secs] 
177.291: [GC (Allocation Failure) 177.291: [ParNew: 1380160K->153344K(1380160K), 0.3148596 secs] 1412594K->352587K(8235264K), 0.3154605 secs] [Times: user=1.53 sys=0.47, real=0.31 secs] 
194.529: [GC (Allocation Failure) 194.529: [ParNew: 1380160K->153344K(1380160K), 0.2991742 secs] 1579403K->528500K(8235264K), 0.2998621 secs] [Times: user=3.39 sys=0.43, real=0.30 secs] 
... x13
267.073: [GC (Allocation Failure) 267.073: [ParNew: 1380160K->153344K(1380160K), 0.6624354 secs] 5809968K->5070515K(8235264K), 0.6626810 secs] [Times: user=8.93 sys=0.50, real=0.66 secs] 
267.737: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4917171K(6855104K)] 5070681K(8235264K), 0.0235663 secs] [Times: user=0.30 sys=0.00, real=0.03 secs] 
267.761: [CMS-concurrent-mark-start]
268.241: [CMS-concurrent-mark: 0.480/0.480 secs] [Times: user=3.62 sys=0.20, real=0.48 secs] 
268.241: [CMS-concurrent-preclean-start]
268.311: [CMS-concurrent-preclean: 0.068/0.070 secs] [Times: user=0.26 sys=0.01, real=0.07 secs] 
268.311: [CMS-concurrent-abortable-preclean-start]
268.861: [GC (Allocation Failure) 268.861: [ParNew: 1380160K->153344K(1380160K), 0.7025517 secs] 6297331K->5566823K(8235264K), 0.7027702 secs] [Times: user=9.85 sys=0.48, real=0.70 secs] 
270.261: [CMS-concurrent-abortable-preclean: 1.241/1.950 secs] [Times: user=13.50 sys=0.65, real=1.95 secs] 
270.262: [GC (CMS Final Remark) [YG occupancy: 972096 K (1380160 K)]270.262: [Rescan (parallel) , 0.1256659 secs]270.388: [weak refs processing, 0.0049507 secs]270.393: [class unloading, 0.0205177 secs]270.414: [scrub symbol table, 0.0079385 secs]270.422: [scrub string table, 0.0010636 secs][1 CMS-remark: 5413479K(6855104K)] 6385575K(8235264K), 0.1651917 secs] [Times: user=2.26 sys=0.00, real=0.16 secs] 
270.428: [CMS-concurrent-sweep-start]
270.810: [GC (Allocation Failure) 270.810: [ParNew: 1380160K->153342K(1380160K), 0.6945854 secs] 6655115K->5934466K(8235264K), 0.6948417 secs] [Times: user=9.72 sys=0.29, real=0.69 secs] 
272.801: [GC (Allocation Failure) 272.801: [ParNew: 1380158K->153344K(1380160K), 0.7545580 secs] 6699606K->5990059K(8235264K), 0.7548343 secs] [Times: user=10.43 sys=0.09, real=0.75 secs] 
... x5
285.505: [GC (Allocation Failure) 285.505: [ParNew: 1380160K->153344K(1380160K), 1.1717376 secs] 6402712K->5752751K(8235264K), 1.1720205 secs] [Times: user=10.44 sys=0.56, real=1.17 secs] 
287.525: [CMS-concurrent-sweep: 10.434/17.097 secs] [Times: user=110.38 sys=5.66, real=17.10 secs] 
287.525: [CMS-concurrent-reset-start]
287.559: [CMS-concurrent-reset: 0.034/0.034 secs] [Times: user=0.07 sys=0.00, real=0.04 secs] 
288.097: [GC (Allocation Failure) 288.097: [ParNew: 1380160K->153342K(1380160K), 1.4499834 secs] 6974837K->6325971K(8235264K), 1.4502986 secs] [Times: user=7.49 sys=0.47, real=1.45 secs] 

2.1.1 解析CMS 日志

  1. initial Mark
267.737: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4917171K(6855104K)] 5070681K(8235264K), 0.0235663 secs] [Times: user=0.30 sys=0.00, real=0.03 secs] 

==stop-the-wrold==,开始使用CMS收集器进行老年代的回收.初始标记阶段,这个阶段标记由根可以直接到达的对象,标记期间所有的线程会暂停. 老年代容量为6855104K, CMS GC在old区空间占用达到4917171K时触发.
2. concurrent mark start

267.761: [CMS-concurrent-mark-start]

并发标记开始阶段,第一阶段中被暂停的线程重新开始运行,由前一个阶段标记的对象触发,所有可到达的对象都在本阶段中标记
3. concurrent mark end

268.241: [CMS-concurrent-mark: 0.480/0.480 secs] [Times: user=3.62 sys=0.20, real=0.48 secs] 

并发标记阶段结束,占用0.480秒的cpu时间
4. concurrent preclean start

268.241: [CMS-concurrent-preclean-start]

开始预清理阶段
预清理也是一个并发执行阶段. 本阶段中查找前一阶段执行过程中,从新生代晋升或新分配或被更新的对象. 通过并发的重新扫描这些对象,预处理阶段可以减少下一个stop-the-world重新标记阶段的工作量.
5. concurrent preclean end

268.311: [CMS-concurrent-preclean: 0.068/0.070 secs] [Times: user=0.26 sys=0.01, real=0.07 secs] 

预处理阶段耗时 0.068秒,0.070秒 墙钟时间
6. concurrent abortable preclean start

268.311: [CMS-concurrent-abortable-preclean-start]
  1. ParNew minor GC
268.861: [GC (Allocation Failure) 268.861: [ParNew: 1380160K->153344K(1380160K), 0.7025517 secs] 6297331K->5566823K(8235264K), 0.7027702 secs] [Times: user=9.85 sys=0.48, real=0.70 secs] 

==stop-the-world==,新生代使用ParNew并行收集器. 新生代容量1380160K, GC回收后占用从1380160K到153344K, 耗时0.7025517秒. 整个堆的占用从6297331K降低到5566823K,耗时0.7027702秒,堆的总大小为8235264K.
注意,这里的Allocation Failure是正常现象,表明Eden区已满,无法继续分配新对象,需要进行一次Minor GC, 但是这里把正常现象描述成了failure,带有一定的误导性.
8. concurrent abortable preclean end

270.261: [CMS-concurrent-abortable-preclean: 1.241/1.950 secs] [Times: user=13.50 sys=0.65, real=1.95 secs] 
  1. final remark
270.262: [GC (CMS Final Remark) [YG occupancy: 972096 K (1380160 K)]270.262: [Rescan (parallel) , 0.1256659 secs]270.388: [weak refs processing, 0.0049507 secs]270.393: [class unloading, 0.0205177 secs]270.414: [scrub symbol table, 0.0079385 secs]270.422: [scrub string table, 0.0010636 secs][1 CMS-remark: 5413479K(6855104K)] 6385575K(8235264K), 0.1651917 secs] [Times: user=2.26 sys=0.00, real=0.16 secs] 

==stop-the-world==阶段,从根对象及被其引用的对象开始,重新扫描CMS堆中残留的更新过的对象. 这里重新扫描耗时0.1256659秒, 处理弱引用对象耗时 0.0049507秒,本阶段耗时 0.1651917秒.
10. concurrent sweep start

270.428: [CMS-concurrent-sweep-start]

开始并发清理阶段,在清理阶段,应用线程还在运行
11. concurrent sweep end

287.525: [CMS-concurrent-sweep: 10.434/17.097 secs] [Times: user=110.38 sys=5.66, real=17.10 secs] 

并发清理阶段耗时17.097秒
12. concurrent reset start

287.525: [CMS-concurrent-reset-start]

开始并发重置
13. concurrent reset end

287.559: [CMS-concurrent-reset: 0.034/0.034 secs] [Times: user=0.07 sys=0.00, real=0.04 secs] 

并发重置结束,在本阶段,重新初始化CMS内部数据结构,以备下一轮GC的使用.

2.2 调优JVM GC

2.2.1 第一次尝试- 增大JVM内存

修改后配置如下:
1. HBase RegionServer 的 Java 堆栈大小: 16G
2. HBase RegionServer 的 Java 配置选项

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=8  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 

主要将Heap size内存增大1倍, 并且
1) -XX:CMSInitiatingOccupancyFraction=80: Old区触发比率调整为80%
2) -XX:+UseCMSCompactAtFullCollection: CMS是不会移动内存的, 因此, 非常容易产生碎片,导致内存不够用,从而频繁触发Full GC, 因此,内存的压缩这个时候就会被启用, 增加这个参数是个好习惯,可能会影响性能,但是可以消除碎片
3) -XX:CMSFullGCsBeforeCompaction=8 运行多少次Full GC之后,进行一次碎片整理,调整该参数能够优化性能

结果:

并不是很明显,依然频繁的出现GC Allocation Failure, 导致频繁的 ParNew Young GC. 如上日志选择的不是很明显,其实主要是过多的Young GC导致JVM暂停, CMS old GC只有在initial Mark和final remark时,才会stop-the-world,所以,可以考虑减少Young GC频率

第二次尝试- 增大Young 区域所占比例

  1. -Xms、-Xmx分配用来设置进程堆内存的最小大小和最大大小
  2. -Xmn 用来设置堆内新生代的大小,通过这个值我们也可以得到老生代的大小:-Xmx减去-Xmn
  3. -Xss设置每个线程可使用的内存大小,在相同物理内存下,减小这个值能生成更多的线程。当然操作系统对一个进程内的线程数还是有限制的,不能无限生成,经验值在3000~5000左右
  4. -XX:PermSize、-XX:MaxPermSize 分配用来设置永久代的最小大小和最大大小。Java 8以后移除了方法区,取而代之的是本地元空间Metaspace,大小由-XX:MetaspaceSize和-XX:MaxMetaspaceSize调节
  5. -XX:MaxTenuringThreshold 设置转入老生代的存活次数。如果是0,则直接跳过新生代进入老生代
  6. -XX:NewRatio 设置老生代和新生代的比值,例如该值为3,则表示新生代与老生代比值为1:3

所以,通过-Xmn设置新生代区域大小为3G,较少Young GC频率

结果:

一定程度上缓解了GC超时的问题,因为同时删除了HBase中大量数据,所以,待观察. 如果依然存在问题,考虑使用G1垃圾回收器.

参考:
1. http://ifeve.com/jvm-cms-log/
2. http://blog.jobbole.com/109170/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值