如何用MapReduce程序操作hbase

最新推荐文章于 2022-06-09 10:59:03 发布

liuyuan185442111

最新推荐文章于 2022-06-09 10:59:03 发布

阅读量4k

点赞数

分类专栏： Old Hadoop 文章标签： hbase

本文链接：https://blog.csdn.net/liuyuan185442111/article/details/45306193

版权

Old Hadoop 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

先看一个标准的hbase作为数据读取源和输出目标的样例：

Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "job name ");
job.setJarByClass(test.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob(inputTable, scan, mapper.class, Writable.class, Writable.class, job);
TableMapReduceUtil.initTableReducerJob(outputTable, reducer.class, job);
job.waitForCompletion(true);

和普通的mr程序不同的是，不再用job.setMapperClass()和job.setReducerClass()来设置mapper和reducer，而用TableMapReduceUtil的initTableMapperJob和initTableReducerJob方法来实现。此处的TableMapReduceUtil是hadoop.hbase.mapreduce包中的，而不是hadoop.hbase.mapred包中的。
数据输入源是hbase的inputTable表，执行mapper.class进行map过程，输出的key/value类型是 ImmutableBytesWritable和Put类型，最后一个参数是作业对象。需要指出的是需要声明一个扫描读入对象scan，进行表扫描读取数据用，其中scan可以配置参数。
数据输出目标是hbase的outputTable表，输出执行的reduce过程是reducer.class类，操作的作业目标是job。与map比缺少输出类型的标注，因为他们不是必要的，看过源代码就知道mapreduce的TableRecordWriter中write(key,value) 方法中，key值是没有用到的，value只能是Put或者Delete两种类型，write方法会自行判断并不用用户指明。

mapper类从hbase读取数据，所以输入的

public class mapper extends TableMapper<KEYOUT, VALUEOUT> {
    public void map(Writable key, Writable value, Context context)
    throws IOException, InterruptedException {
        //mapper逻辑
        context.write(key, value);
    }
}

mapper继承的是TableMapper类，后边跟的两个泛型参数指定mapper输出的数据类型，该类型必须继承自Writable类，例如可能用到的put和delete就可以。需要注意的是要和initTableMapperJob 方法指定的数据类型一致。该过程会自动从指定hbase表内一行一行读取数据进行处理。

reducer类将数据写入hbase，所以输出的

public class reducer extends TableReducer<KEYIN, VALUEIN, KEYOUT> {
    public void reduce(Text key, Iterable<VALUEIN> values, Context context)
    throws IOException, InterruptedException {
        //reducer逻辑
        context.write(null, put or delete);
    }
}

reducer继承的是TableReducer类，后边指定三个泛型参数，前两个必须对应map过程的输出key/value类型，第三个是 The type of the output key，write的时候可以把key写成IntWritable什么的都行，它是不必要的。这样reducer输出的数据会自动插入outputTable指定的表内。

TableMapper和TableReducer的本质就是为了简化一下书写代码，因为传入的4个泛型参数里都会有固定的参数类型，所以是Mapper和Reducer的简化版本，本质他们没有任何区别。源码如下：

public abstract class TableMapper<KEYOUT, VALUEOUT>
extends Mapper<ImmutableBytesWritable, Result, KEYOUT, VALUEOUT> {
}

public abstract class TableReducer<KEYIN, VALUEIN, KEYOUT>
extends Reducer<KEYIN, VALUEIN, KEYOUT, Writable> {
}

封装了一层确实方便多了，但也多了很多局限性，就不能在map里写hbase吗？
我他么试了一下午，约5个小时，就想在map里读hdfs写hbase，莫名其妙的各种问题，逻辑上应该没有错，跟着别人的文章做的。最后还是通过IdentityTableReducer这个类实现了，what's a fucking afternoon!

官方对IdentityTableReducer的说明是：Convenience class that simply writes all values (which must be Put or Delete instances) passed to it out to the configured HBase table.
这是一个工具类，将map输出的value（只能是Put或Delete）pass给HBase。看例子：

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.IdentityTableReducer;

public class WordCount
{
    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, Put>
    {
        private Text word = new Text();
        public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException
        {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens())
            {
                word.set(itr.nextToken());
                Put putrow = new Put(word.toString().getBytes());
                putrow.add("info".getBytes(), "name".getBytes(),"iamvalue".getBytes());
                context.write(word, putrow);
            }
        }
    }

    public static void main(String[] args) throws Exception
    {
        Configuration conf = HBaseConfiguration.create();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        Job job = new Job(conf, "hdfs to hbase");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Put.class);//important
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        TableMapReduceUtil.initTableReducerJob("test", IdentityTableReducer.class, job);
        job.setNumReduceTasks(0);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

无论是什么方法吧，总算可以运行了！
MapReduce和HBase结合，似乎是这样一种框架：map读HBase，reduce写HBase。使用IdentityTableReducer就是处于这样一种框架之内。

运行操作HBase的MapReduce程序的第2种方式：HADOOP_CLASSPATH的设置

在我《HBase操作》一文中提到了运行操作HBase的MapReduce程序的两种方式，现在说明下另一种方式。
打开hadoop/etc/hadoop/hadoop-env.sh，在设置HADOOP_CLASSPATH的后面添加下面的语句，即将hbase的jar包导入：

for f in /home/laxe/apple/hbase/lib/*.jar; do
    if [ "$HADOOP_CLASSPATH" ]; then
        export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
    else
        export HADOOP_CLASSPATH=$f
    fi
done

然后就可以用最初的运行MapReduce的方式来运行了。

参考文献

hbase MapReduce程序样例入门
http://www.cnblogs.com/end/archive/2012/12/12/2814819.html
HBase之读取MapReduce数据写入HBase
http://www.linuxidc.com/Linux/2015-03/114670p6.htm
HBase之读取HBase数据写入HDFS
http://www.linuxidc.com/Linux/2015-03/114670p7.htm
HBase入门基础教程
http://www.linuxidc.com/Linux/2015-03/114670.htm
HBase 5种写入数据方式
http://www.tuicool.com/articles/aiQVvm

liuyuan185442111

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
如何用MapReduce程序操作hbase

先看一个标准的hbase作为数据读取源和输出目标的样例：Configuration conf = HBaseConfiguration.create();Job job = new Job(conf, "job name ");job.setJarByClass(test.class);Scan scan = new Scan();TableMapReduceUtil.initTableMa
复制链接

扫一扫

专栏目录