MapReduce on HBase

最新推荐文章于 2024-05-27 16:07:15 发布

Wang_AI

最新推荐文章于 2024-05-27 16:07:15 发布

阅读量1.1k

点赞数

本文链接：https://blog.csdn.net/Xw_Classmate/article/details/50916674

版权

HBase 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

引言

HBase跟Hadoop的无缝集成使得使用MapReduce对HBase的数据进行分布式计算非常方便，本文将以前面的blog示例，介绍HBase下MapReduce开发要点。很好理解本文前提是你对Hadoop MapReduce有一定的了解。

HBase MapReduce核心类介绍

首先一起来回顾下MapReduce的基本编程模型，

可以看到最基本的是通过Mapper和Reducer来处理KV对，Mapper的输出经Shuffle及Sort后变为Reducer的输入。除了Mapper和Reducer外，另外两个重要的概念是InputFormat和OutputFormat，定义了Map-Reduce的输入和输出相关的东西。HBase通过对这些类的扩展（继承）来方便MapReduce任务来读写HTable中的数据。

实例分析
我们还是以最初的blog例子来进行示例分析，业务需求是这样：找到具有相同兴趣的人，我们简单定义为如果author之间article的tag相同，则认为两者有相同兴趣，将分析结果保存到HBase。除了上面介绍的blog表外，我们新增一张表tag_friend，RowKey为tag，Value为authors,大概就下面这样。

我们省略了一些跟分析无关的Column数据，上面的数据按前面描述的业务需求经过MapReduce分析，应该得到下面的结果

实际的运算过程分析如

代码实现

有了上面的分析，代码实现就比较简单了。只需以下几步定义Mapper类继承TableMapper，map的输入输出KV跟上面的分析一致。

附上源代码：

package hbase.mr;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;

public class FindFriend {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		conf = HBaseConfiguration.create(conf);
		Job job = Job.getInstance(conf, "HBase_FindFriend");
		job.setJarByClass(FindFriend.class);

		Scan scan = new Scan();
		scan.setCaching(100);// 设置缓存
		scan.addColumn(Bytes.toBytes("author"), Bytes.toBytes("nickname"));
		scan.addColumn(Bytes.toBytes("article"), Bytes.toBytes("tags"));

		TableMapReduceUtil
				.initTableMapperJob("blog", scan, MyMapper.class,
						ImmutableBytesWritable.class,
						ImmutableBytesWritable.class, job);
		TableMapReduceUtil.initTableReducerJob("tag_friend", MyReducer.class,
				job);
		
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

	// TableMapper的KEYIN固定为ImmutableBytesWritable类型，VALUEIN固定为Result类型
	// 自定义TableMapper的KEYOUT为ImmutableBytesWritable类型，VALUEOUT为ImmutableBytesWritable类型
	public static class MyMapper extends
			TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {
		ImmutableBytesWritable key = null;
		ImmutableBytesWritable val = null;
		String[] tags = null;

		@Override
		protected void map(ImmutableBytesWritable key, Result value,
				Context context) throws IOException, InterruptedException {
			for (Cell cell : value.listCells()) {
				if ("author".equals(Bytes.toString(cell.getFamilyArray()))
						&& "nickname".equals(Bytes.toString(cell
								.getQualifierArray()))) {
					val = new ImmutableBytesWritable(cell.getValueArray());
				}
				if ("article".equals(Bytes.toString(cell.getFamilyArray()))
						&& "tags".equals(Bytes.toString(cell
								.getQualifierArray()))) {
					tags = Bytes.toString(cell.getValueArray()).split(",");
				}
			}

			for (String tag : tags) {
				key = new ImmutableBytesWritable(Bytes.toBytes(tag
						.toLowerCase()));
				context.write(key, val);
			}

		}
	}

	// 自定义KEYIN、VALUEIN和KEYOUT都为ImmutableBytesWritable类型
	// TableReducer的VALUEOUT固定为Mutation类型
	// (hadoop,<yedu,heyun>)
	public static class MyReducer
			extends
			TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
		@Override
		protected void reduce(ImmutableBytesWritable key,
				Iterable<ImmutableBytesWritable> values, Context context)
				throws IOException, InterruptedException {
			String friends = "";
			for (ImmutableBytesWritable val : values) {
				friends += (friends.length() > 0 ? "," : "")
						+ Bytes.toString(val.get());
			}

			Put put = new Put(key.get());
			put.add(Bytes.toBytes("person"), Bytes.toBytes("nicknames"),
					Bytes.toBytes(friends));

			context.write(key, put);
		}
	}

}

Wang_AI

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce on HBase

引言HBase跟Hadoop的无缝集成使得使用MapReduce对HBase的数据进行分布式计算非常方便，本文将以前面的blog示例，介绍HBase下MapReduce开发要点。很好理解本文前提是你对Hadoop MapReduce有一定的了解。HBase MapReduce核心类介绍首先一起来回顾下MapReduce的基本编程模型，可以看到最基本的是通过Mappe
复制链接

扫一扫