MapReduce跑数导入HBase

最新推荐文章于 2020-07-23 15:21:00 发布

maclaren001

最新推荐文章于 2020-07-23 15:21:00 发布

阅读量760

点赞数

本文链接：https://blog.csdn.net/maclaren001/article/details/44115015

版权

hadoop 同时被 3 个专栏收录

5 篇文章 1 订阅

订阅专栏

hbase

5 篇文章 0 订阅

订阅专栏

MapReduce

2 篇文章 0 订阅

订阅专栏

日常开发中可能会碰到需要编写MapReduce从HDFS上读取数据，然后导入HBase。一般会使用到两种方式，下面分别介绍下。

第一种方式：

指定OutputFormatClass为TableOutputFormat，构造Put对象，然后设置到OutputValueClass去。

		Configuration conf = ConfSource.getHBaseConf();
		Job j = new Job(conf, "Import table " + tbName + " into hbase table:bigtable from " + path);
		j.setMapperClass(Sync2HBaseMapper.class);
		j.setOutputFormatClass(TableOutputFormat.class);
		j.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "bigtable");
		j.setOutputKeyClass(ImmutableBytesWritable.class);
		j.setOutputValueClass(Put.class);
		j.setNumReduceTasks(0);
		j.setJarByClass(Sync2HBaseJob.class);

但是，这种写法在数据量大、节点比较多的情况效率不太好。Reduce节点的输出在MapReduce运行过程中不断导入到HBase，会造成很大的网络开销，而且事务控制也是难点，所以，只是在数据量较少的情况下可以使用该方法。

第二种方式：

使用HFileOutputFormat2类生成HFile， HFile是HBase中KeyValue数据的存储格式，Hadoop的二进制格式文件，实际上StoreFile就是对HFile做了轻量级包装，即StoreFile底层就是HFile。生成的HFile会放置在指定的HDFS目录下，然后是使用completebulkload命令就可以快速地导入到HBase,相对跑MapReduce的时间，completebulkload的执行时间几乎可以忽略不计，本人在16核，128G内存的机器下，600M的数据源MapReduce跑了20分钟，而使用completebulkload导入HBase只需要几秒，非常快。但是要注意的是运行completebulkload后，HDFS上的HFile会被自动删除掉，最好做下备份。

		Configuration conf = ConfSource.getHBaseConf();
		Job job = new Job(conf, "Import into hbase table"
				+ confClz.getHbaseTable() + " from "
				+ confClz.getDownloadPath());
		job.setJarByClass(Sync2HBaseJobViaHFile.class);
		FileInputFormat.setInputPaths(job, confClz.getDownloadPath());
		job.setMapperClass(Sync2HBaseMapper.class);
		HTable table = new HTable(conf, confClz.getHbaseTable());
		job.setReducerClass(PutSortReducer.class);
		Path outputDir = new Path(confClz.getHfilePath());
		FileOutputFormat.setOutputPath(job, outputDir);
		job.setMapOutputKeyClass(ImmutableBytesWritable.class);
		job.setMapOutputValueClass(Put.class);
		HFileOutputFormat2.configureIncrementalLoad(job, table);
		TableMapReduceUtil.addDependencyJars(job);

maclaren001

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce跑数导入HBase

日常开发中可能会碰到需要编写MapReduce从HDFS上读取数据，然后导入HBase。一般会使用到两种方式，下面分别介绍下。第一种方式：指定OutputFormatClass为TableOutputFormat，构造Put对象，然后设置到OutputValueClass去。 Configuration conf = ConfSource.getHBaseConf();
复制链接

扫一扫

专栏目录