上一章博文分析了HFileOutputFormat2.configureIncrementalLoad的使用方法以及弊端。讲述了configureIncrementalLoad内部源码已经给我们设定了Reduce过程,包括map和reduce的输出格式(KeyValue或者Put)、reduce的数量等。而configureIncrementalLoadMap方法没有,所以在生成job时需要手动指定。
我们在这里使用configureIncrementalLoadMap的时候,输出格式为输出格式KeyValue,设置的reduce类为org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer包里自带的reduce的KeyValue排序类。本来是准备自己写的但是我们在使用HFileOutputFormat2.configureIncrementalLoad的时候发现它有一个设置reduce的方式就是这个类,那么我们就直接应用了。注意的是,使用configureIncrementalLoadMap的时候,不仅rowkey要排序,而且KeyValue也是要排序的(实际上是KeyValue的列族的列名排序,要按列名的顺序写数据,否则会爆错误:
Error: java.io.IOException: Added a key not lexically larger than previous. Current cell = xxxxxxxxxx_574747147/common:00000002/1572736500000/Put/vlen=16/seqid=0, lastCell = xxxxxxxxxx_574747147/common:000000001/1572736500000/Put/vlen=16/seqid=0
at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:265)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:994)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:199)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at com.xxx.xxx.xxx.xxx.usepartition.five_min_qu_to_hbase.FiveMinQuMRT2$ToHFileReducer.reduce(FiveMinQuMRT2.java:220)
at com.xxx.xxx.xxx.xxx.usepartition.five_min_qu_to_hbase.FiveMinQuMRT2$ToHFileReducer.reduce(FiveMinQuMRT2.java:211)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
因为当前放入的Cell数据的KeyValue小于已经放入Cell的数据。)。然后,我们还设置了Reduce的个数job.setNumReduceTasks(200)。特别注意的是,我们要在配置项需要手动加上一些序列化的配置,
conf.set("io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization," +
"org.apache.hadoop.io.serializer.WritableSerialization," +
"org.apache.hadoop.hbase.mapreduce.KeyValueSerialization," +
"org.apache.hadoop.hbase.mapreduce.MutationSerialization," +
"org.apache.hadoop.hbase.mapreduce.ResultSerialization" );
不加则会报错。我们还使用了写时覆盖策略,保证一个列五分钟的最后的数据。rowkey里数据时间为归整的五分钟倍数时间间隔的时间戳,每个Cell的timestamp则为真正的时间,当某一个Cell相同rowkey有多个数据,保留最新的数据。
下面是我的代码:
package com.xxx.xxx.xxx.usepartition.five_min_qu_to_hbase;
/**
*/
import org.apache.hadoop.hbase.mapred.TableOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.orc.mapred.OrcStruct;
import org.apache.orc.mapreduce.OrcInputFormat;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.*;
/**
* 2019/11/21测试通过
* 六天的数据大概1T左右,运行时间大概2小时,每台服务器40核可用内存12G,磁盘占用可忽略。
*
*/
public class FiveMinQuMRT2 {
public static class HiveORCToHFileMapper extends
Mapper<NullWritable, OrcStruct, ImmutableBytesWritable, KeyValue> {
public static final byte[] CF = Bytes.toBytes("common");
String checkSubQuID = "";
@Override
public void setup(Context context) throws IOException {
checkSubQuID