Client API - Best Practices for performance tuning

最新推荐文章于 2022-12-11 22:43:44 发布

macyang

最新推荐文章于 2022-12-11 22:43:44 发布

阅读量847

点赞数

分类专栏： database/nosql distributed system 文章标签： performance api mapreduce server hbase cache

本文链接：https://blog.csdn.net/macyang/article/details/6683407

版权

database/nosql 同时被 2 个专栏收录

102 篇文章 0 订阅

订阅专栏

distributed system

76 篇文章 0 订阅

订阅专栏

当使用hbase client API的时候，有时候注意一些细节能够获得更优的性能，下面就列出一些对性能有影响的地方：

When reading or writing data from a client using the API there are a handful of optimizations you should consider to gain the best performance. Here is the list of the best practice options:

Disable Auto Flush

充分利用client write-buffer来提高性能，这里需要注意的就是当buffer中的数据还没有来得及提交给hbase server的时候，可能丢失。

When performing a lot of put operations, make sure that the auto flush feature of HTable is set to false, using the setAutoFlush(false) method. Otherwise, the Put instances will be sent one at a time to the region server. Puts added via HTable.add(Put) and HTable.add( <List> Put) wind up in the same write buffer. If auto flushing is disabled, these operations are not sent until the write-buffer is filled. To explicitly flush the messages, call flushCommits(). Calling close on the HTable instance will implicitly invoke flushCommits().

Use Scanner Caching

这个设置目前我还没有应用到，主要是hbase的表作为MapReduce的输入的时候，将cache大小提高能够提高性能，但是这也让客户端和region server端付出更多的内存。

If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching() set to something greater than the default of 1. Using the default value means that the map-task will make call backs to the region server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and region server, so bigger is not always better.

Limit Scan Scope

尽可能让hbase server只返回我们需要的数据来避免通过网络传输大量无用的column给客户端，将返回的数据做”瘦身“处理很重要。

Whenever a Scan is used to process large numbers of rows (and especially when used as a MapReduce source), be aware of which attributes are selected. If Scan.addFamily() is called then all of the columns in the specified column family will be returned to the client. If only a small number of the available columns are to be processed, then only those should be specified in the input scan because column over-selection is a non-trivial performance penalty over large datasets.

Close ResultScanners

一个好的习惯就是打开的东西不用的时候就关闭它，使用完ResultScanner记得要关闭啊！

This isn't so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanner instances, as returned by HTable,getScanner(), you can cause problems on the region servers.

Always have ResultScanner processing enclosed in try/catch blocks, for examaple:

Scan scan = new Scan();
// configure scan instance
ResultScanner scanner = table.getScanner(scan);
try {
  for (Result result : scanner) {
  // process result...
} finally {
  scanner.close();  // always close the scanner!
}
table.close();

Block Cache Usage

既然是cache，就是对于频繁访问的数据有积极的作用，而对于那些访问一次就不会再访问了的数据，cache就多此一举了。

Scan instances can be set to use the block cache in the region server via the setCacheBlocks() method. For scans used with MapReduce jobs, this should be false. For frequently accessed rows, it is advisable to use the block cache.

Optimal Loading of Row Keys

Filter的强大之处，各种”瘦身“方法。

When performing a table scan where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a MUST_PASS_ALL operator to the scanner using setFilter(). The filter list should include both a FirstKeyOnlyFilter and a KeyOnlyFilter instance, as explained in the section called “Dedicated Filters”. Using this filter combination will cause the region server to only load the row key of the first KeyValue (in other words, from the first column) found and return it to the client, resulting in minimized network traffic.

Turn Off WAL on Puts

WAL能够保证region server crash的时候，去恢复还没有flush到磁盘上的memstore中的数据，没有特别的原因还是打开吧！

A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means that the region server will not write the Put to the Write-Ahead Log, only into the memstore, however the consequence is that if there is a region server failure there will be data loss. If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that it makes little difference if your load is well distributed across the cluster.

In general, it is best to use the WAL for puts, and where loading throughput is a concern to use the bulk loading techniques instead, as explained in the section called “Bulk Import”.

除了这些以外，肯定还有很多其他需要注意的地方，这就需要在实战中去积累了！

macyang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Client API - Best Practices for performance tuning

当使用hbase client API的时候，有时候注意一些细节能够获得更优的性能，下面就列出一些对性能有影响的地方：When reading or writing data from a client using the API there are a handful of
复制链接

扫一扫

专栏目录