MapReduce写HFile，doBulkLoad方式批量导入到HBase（用 HFileOutputFormat2.configureIncrementalLoadMap方式推荐）

最新推荐文章于 2022-06-30 18:35:50 发布

拉普达男孩

最新推荐文章于 2022-06-30 18:35:50 发布

阅读量1.5k

点赞数

分类专栏：大数据文章标签： MapReduce HFile configureIncrementalLoadMap

本文链接：https://blog.csdn.net/ITwangnengjie/article/details/103194518

版权

上一章博文分析了HFileOutputFormat2.configureIncrementalLoad的使用方法以及弊端。讲述了configureIncrementalLoad内部源码已经给我们设定了Reduce过程，包括map和reduce的输出格式（KeyValue或者Put）、reduce的数量等。而configureIncrementalLoadMap方法没有，所以在生成job时需要手动指定。

我们在这里使用configureIncrementalLoadMap的时候，输出格式为输出格式KeyValue，设置的reduce类为org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer包里自带的reduce的KeyValue排序类。本来是准备自己写的但是我们在使用HFileOutputFormat2.configureIncrementalLoad的时候发现它有一个设置reduce的方式就是这个类，那么我们就直接应用了。注意的是，使用configureIncrementalLoadMap的时候，不仅rowkey要排序，而且KeyValue也是要排序的（实际上是KeyValue的列族的列名排序，要按列名的顺序写数据，否则会爆错误：

Error: java.io.IOException: Added a key not lexically larger than previous. Current cell = xxxxxxxxxx_574747147/common:00000002/1572736500000/Put/vlen=16/seqid=0, lastCell = xxxxxxxxxx_574747147/common:000000001/1572736500000/Put/vlen=16/seqid=0
	at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
	at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:265)
	at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:994)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:199)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)
	at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
	at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
	at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
	at com.xxx.xxx.xxx.xxx.usepartition.five_min_qu_to_hbase.FiveMinQuMRT2$ToHFileReducer.reduce(FiveMinQuMRT2.java:220)
	at com.xxx.xxx.xxx.xxx.usepartition.five_min_qu_to_hbase.FiveMinQuMRT2$ToHFileReducer.reduce(FiveMinQuMRT2.java:211)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

因为当前放入的Cell数据的KeyValue小于已经放入Cell的数据。）。然后，我们还设置了Reduce的个数job.setNumReduceTasks(200)。特别注意的是，我们要在配置项需要手动加上一些序列化的配置，

conf.set("io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization," +
        "org.apache.hadoop.io.serializer.WritableSerialization," +
        "org.apache.hadoop.hbase.mapreduce.KeyValueSerialization," +
        "org.apache.hadoop.hbase.mapreduce.MutationSerialization," +
        "org.apache.hadoop.hbase.mapreduce.ResultSerialization" );

不加则会报错。我们还使用了写时覆盖策略，保证一个列五分钟的最后的数据。rowkey里数据时间为归整的五分钟倍数时间间隔的时间戳，每个Cell的timestamp则为真正的时间，当某一个Cell相同rowkey有多个数据，保留最新的数据。

下面是我的代码：

package com.xxx.xxx.xxx.usepartition.five_min_qu_to_hbase;
/**
 */

import org.apache.hadoop.hbase.mapred.TableOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.orc.mapred.OrcStruct;
import org.apache.orc.mapreduce.OrcInputFormat;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.*;
/**
 * 2019/11/21测试通过
 *  六天的数据大概1T左右，运行时间大概2小时，每台服务器40核可用内存12G，磁盘占用可忽略。
 *  
 */
public class FiveMinQuMRT2 {
  
    public static class HiveORCToHFileMapper extends
            Mapper<NullWritable, OrcStruct, ImmutableBytesWritable, KeyValue> {
        public static final byte[] CF = Bytes.toBytes("common");
        String checkSubQuID = "";

        @Override
        public void setup(Context context) throws IOException {
            checkSubQuID

最低0.47元/天解锁文章

拉普达男孩

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
MapReduce写HFile，doBulkLoad方式批量导入到HBase（用 HFileOutputFormat2.configureIncrementalLoadMap方式推荐）

上一章博文分析了HFileOutputFormat2.configureIncrementalLoad的使用方法以及弊端。讲述了configureIncrementalLoad内部源码已经给我们设定了Reduce过程，包括map和reduce的输出格式（KeyValue或者Put）、reduce的数量等。而configureIncrementalLoadMap方法没有，所以在生成j...
复制链接

扫一扫