记一次hbase写入慢排查分析

kobexzf

已于 2022-07-07 11:26:42 修改

阅读量1.6k

点赞数

分类专栏： Nosql 文章标签： java

于 2022-07-07 11:11:13 首次发布

本文链接：https://blog.csdn.net/kobexzf/article/details/125655149

版权

Nosql 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

版本：1.2.6

现象：hbase写入耗时长，单机slow put数量上升，单机flush慢，内存达到高水位block request至高水位下(内存达到低水位会不断flush至低水位下)，目前24G内存，高水位9.6G(24*0.4)，低水位7.68G(9.6*0.8)

具体分析：

1 flush慢
并非写入hdfs慢，怀疑是scan耗时太长，甚至可能死锁了

2 flush的size大
很好理解，并不是说达到128MB立即就flush了，是扔到flush队列了，flush队列可能有普通region的flush(队列，先进先出)和水位线flush(优先flush占内存大的)，如果flush很慢，等处理到你的时候，你可能内存已经远超阈值了

3 hlog数量狂增
怀疑删除机制问题?

解决方法：将高水位升至14.4G(24*0.6), 尽量避免请求block

日志:
水位线的长耗时flush-flush所有store
Flush of region hubble_app_log_diagnosis,574D9B34-A876-4A4B-A846-B5D0873BA2CC92233704017307906970718,1635282389698.d0b92fc4724e4f230616b26d03cffa17. due to global heap pressure. Total Memstore size=9.4 G, Region memstore size=168.0 M
2022-01-18 08:32:09,640 INFO [MemStoreFlusher.1] regionserver.HRegion: Flushing 2/2 column families, memstore=168.04 MB
2022-01-18 08:33:12,603 INFO [MemStoreFlusher.1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=2549251141, memsize=168.0 M, hasBloomFilter=true, into tmp file hdfs://hubble:8020/hbase_data/data/default/hubble_app_log_diagnosis/d0b92fc4724e4f230616b26d03cffa17/.tmp/81e64cda1e87440dbfbd168bb4f9b318
2022-01-18 08:33:12,610 INFO [MemStoreFlusher.1] regionserver.HStore: Added hdfs://hubble:8020/hbase_data/data/default/hubble_app_log_diagnosis/d0b92fc4724e4f230616b26d03cffa17/cf/81e64cda1e87440dbfbd168bb4f9b318, entries=520165, sequenceid=2549251141, filesize=19.6 M
2022-01-18 08:33:12,611 INFO [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~168.04 MB/176205272, currentsize=4.94 MB/5180784 for region hubble_app_log_diagnosis,574D9B34-A876-4A4B-A846-B5D0873BA2CC92233704017307906970718,1635282389698.d0b92fc4724e4f230616b26d03cffa17. in 62971ms, sequenceid=2549251141, compaction requested=true

普通flush-某region的memstore总大小大于128MB，触发region的flush，默认flushPolicy是flush大于16MB的store，若region的memstore总大小达到128MB*4-触发block, 抛出RegionTooBusyException异常，这个和上面的slow put(block至高水位以下)不一样
Flushing 1/2 column families, memstore=128.00 MB; cf=128.01 MB
2022-01-20 22:32:39,330 INFO [MemStoreFlusher.1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=2366406713, memsize=128.0 M, hasBloomFilter=true, into tmp file hdfs://hubble:8020/hbase_data/data/default/hubble_app_log_diagnosis/da586b925c3f41515c0d6870e01371f5/.tmp/b932318cba5948b9baa0a92fc9763ed6
2022-01-20 22:32:39,342 INFO [MemStoreFlusher.1] regionserver.HStore: Added hdfs://hubble:8020/hbase_data/data/default/hubble_app_log_diagnosis/da586b925c3f41515c0d6870e01371f5/cf/b932318cba5948b9baa0a92fc9763ed6, entries=396461, sequenceid=2366406713, filesize=14.5 M
2022-01-20 22:32:39,343 INFO [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~128.01 MB/134223640, currentsize=58.96 KB/60376 for region hubble_app_log_diagnosis,4352a67c50b1ebf3e70b712c75d4bb3992233704432624727940749,1624348860256.da586b925c3f41515c0d6870e01371f5. in 547ms, sequenceid=2366406713, compaction requested=true

相关issue:
https://issues.apache.org/jira/browse/HBASE-2087
1 The method MemStoreFlusher#checkStoreFileCount is called from flushRegion. flushRegion is called by MemStoreFlusher#run thread. If the checkStoreFileCount finds too many store files, it’ll stick around waiting on a compaction to happen. While its hanging, the MemStoreFlusher#run is held up. No other region can flush. Meantime WALs will be rolling and memory will be accumulating writes.
2 The issue of delaying memstore flush still occurs after backport hbase-15871.Reverse scan takes a long time to seek previous row in the memstore full of deleted cells.
3 Memstore flush doesn’t finish because of backwardseek() in memstore scanner.
4 In write heavy scenario one of the regions does not get flushed causing RegionTooBusyExceptione

kobexzf

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
记一次hbase写入慢排查分析

版本：1.2.6现象：hbase写入耗时长，单机slow put数量上升，单机flush慢，内存达到高水位block request至高水位下(内存达到低水位会不断flush至低水位下)，目前24G内存，高水位9.6G(240.4)，低水位7.68G(9.60.8)具体分析：1 flush慢并非写入hdfs慢，怀疑是scan耗时太长，甚至可能死锁了2 flush的size大很好理解，并不是说达到128MB立即就flush了，是扔到flush队列了，flush队列可能有普通region的flush(队列，
复制链接

扫一扫

专栏目录