使用MapReduce生成HFile文件是导入大量数据到HBase的最快方法
总共分为两部分,生成HFile和导入到HBase
一、生成HFile
1.主程序ConvertToHFiles.java
public class ConvertToHFiles extends Configured implements Tool {
private static final Log LOG = LogFactory.getLog(ConvertToHFiles.class);
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ConvertToHFiles(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {
try {
Configuration conf = HBaseConfiguration.create();
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());
String inputPath = args[0];
String outputPath = args[1];
final TableName tableName = TableName.valueOf(args[2]);
//create hbase connection
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(tableName);
//create job
Job job = Job.getInstance(conf, "ConvertToHFiles: Convert File to HFiles");
job.setInputFormatClass(TextInputFormat.class);
job.setJarByClass(ConvertToHFiles.class);
job.setMapperClass(ConvertToHFilesMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
HFileOutputFormat2.configureIncrementalLoad(job, table, connection.getRegionLocator(tableName));
FileInputFormat.setInputPaths(job, inputPath);
HFileOutputFormat2.setOutputPath(job, new Path(outputPath));
if (!job.waitForCompletion(true)) {
LOG.error("Failure");
} else {
LOG.info("Success");
return 0;
}
} catch (Exception e) {
e.printStackTrace();
}
return 1;
}
}
2.Mapper端 ConvertToHFilesMapper.java
public class ConvertToHFilesMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Cell> {
public static final byte[] CF = Bytes.toBytes("f");
public static final ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
static ArrayList<byte[]> qualifiers = new ArrayList<>();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
context.getCounter("Convert", "mapper").increment(1);
//列的字段,这里是三列
byte[] name = Bytes.toBytes("name");
byte[] xxx = Bytes.toBytes("xxx");
byte[] score = Bytes.toBytes("score");
qualifiers.add(name);
qualifiers.add(xxx);
qualifiers.add(score);
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//字段以逗号分割
String[] line = value.toString().split(",");
byte[] rowKeyBytes = DigestUtils.md5Hex(line[0]).getBytes();
rowKey.set(rowKeyBytes);
context.getCounter("Convert", line[2]).increment(1);
for (int i = 0; i < line.length - 1; i++) {
KeyValue kv = new KeyValue(rowKeyBytes, CF, qualifiers.get(i), Bytes.toBytes(line[i + 1]));
if (null != kv) {
context.write(rowKey, kv);
}
}
}
}
这样就会在out目录下生成_SUCCESS和对应columnFamily的文件夹,文件夹下就是HFile文件
columnFamily的文件夹下的HFile文件:
二、将生成的HFIle导入到HBase
public class HFile2HBase {
public static void main(String[] args) {
String table_name = args[0];
String output_dir = args[1];
//配置文件设置
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "192.168.x.xx");
conf.set("hbase.metrics.showTableName", "false");
Path dir = new Path(output_dir);
//把生成的HFile导入到hbase当中
try {
Connection conn = ConnectionFactory.createConnection(conf);
// get table
Table table = conn.getTable(TableName.valueOf(table_name));
//get regionLocator
RegionLocator regionLocator = conn.getRegionLocator(TableName.valueOf(table_name));
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
//run bulkLoad
loader.doBulkLoad(dir, new HBaseAdmin(conn), table, regionLocator);
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
}
大功告成,去hbase里查看就可以了~
同时遇到了一个问题,看了一些博客说HFile导入仅适合初次数据导入,即表内数据为空,或者每次入库表内都无数据的情况。但是我第二次导入了不同的HFIle文件到同一个表也导入成功了,数据也增加了,不知道怎么回事,难道hbase更新了,这个问题待解决。