如果你需要在HBase的数据上做MapReduce任务,记得打开压缩选项。
IO speed is always performance bottleneck in any case. So focus on IO performance generally is best practice for performance tuning.
Data compression is one of way to improve IO performance.
Below table is our case, use LZO compression on HBase compare with data none compression.
compression algorithm | Record Count | HDFS Space usage(GB) | MapReduce Job Time |
NONE | 400,000 | 190 | 19mins, 24sec |
LZO | 400,000 | 46 | 9mins, 34sec |
Almost 100% increase performance, impressive.
For the compression algorithm, Snappy is another option which seems more faster than LZO.
see, http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/ and http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of