HBase与MapReduce整合

最新推荐文章于 2024-01-21 03:36:23 发布

To_Drill

最新推荐文章于 2024-01-21 03:36:23 发布

阅读量1.1k

点赞数 2

文章标签： HBase MapReduce HDFS RDBMS 大数据

本文链接：https://blog.csdn.net/zzf1510711060/article/details/84729116

版权

一、如果在MapReuduce中使用HBase的API操作HBase时出现下面的错误

Error: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration

解决方式一：

在yarn的时候添加jar包的另一种方式

export libpath=/path to hbase jar

hadoop jar myjarname.jar <classpath> arguments -libjar ${libpath}

解决方式二：

hbase-site.xml添加到$ HADOOP_HOME / conf并且把HBase lib中的jar包都拷贝到Hadoop lib下

解决方式三：

把HBase的jar包整合进Hadoop ClassPath，在hadoop-env.sh加上

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/soft/hbase/lib/*

遗憾的是我试了上面方式，结果都失败了！

HBase和MapReduce的整合有官方方式，如果在Map类中或者Reduce类写对HBase的操作会疯狂报上面的错误，我还没找到解决的办法，原因的话我猜是AppMstr把map task和reduce task打包到Container中运行的时候并没有把相应的jar包也打包。不过官方文档中给出的原因是默认情况下，部署到MapReduce集群的MapReduce作业无权访问$ HBASE_CONF_DIR下的HBase配置或HBase类。

二、官方整合方式

注意点：

1、HBase可以作为MapReduce的数据源或者数据接收器。

2、在编写读取或写入HBase的MapReduce Job时，建议继承TableMpper或者TableReducer。

3、mapreduce.job.maps的数量要大于HBase中Region的数量。

4、在插入HBase的时候会对数据进行一次排序，所以不需要在Reduce中再进行排序。

5、如果job逻辑不需要Reduce进行操作可以把Reduce的数量设置为0

6、HBase和Hive的依赖Jar包中有很多Jar包和Spark的Jar包冲突（相同名字不同版本）

官方示例一：从HBase中的一张表中的数据复制到另一张表中

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class);    

Scan scan = new Scan();
scan.setCaching(500);        // 默认扫描数为1，但这对于MR作业来说是不好的
scan.setCacheBlocks(false);  // 对于MR作业要设置为false
// 还可以设置其他Scan参数

TableMapReduceUtil.initTableMapperJob(
  sourceTable,      // 输入表
  scan,             // Scan实例用来控制列族和属性的选择
  MyMapper.class,   // mapper class
  null,             // mapper输出key的class
  null,             // mapper输出value的class
  job);
TableMapReduceUtil.initTableReducerJob(
  targetTable,      // 输出表
  null,             // reducer class
  job);
job.setNumReduceTasks(0);

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

这里TableMapReduceUtil的作用是设置outputFormat类为TableOutputFormat ，并在配置上设置了几个参数（比如TableOutputFormat.OUTPUT_TABLE），以及将reducer输出key设置为ImmutableBytesWritable和reducer输出value为Writable。这些设置都可以在conf中设置，TableMapReduceUtil只是把这件事情变得更容易。

public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    // 这个例子只是从源表中Copy数据
      context.write(row, resultToPut(row,value));
    }

    private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
      Put put = new Put(key.get());
      for (KeyValue kv : result.raw()) {
        put.add(kv);
      }
      return put;
    }
}

多表输出 MultiTableOutputFormat

官方示例二：对表中的某一列数据计数并写入另一个表中

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class);     

Scan scan = new Scan();
scan.setCaching(500);        
scan.setCacheBlocks(false);  


TableMapReduceUtil.initTableMapperJob(
  sourceTable,        
  scan,               
  MyMapper.class,     
  Text.class,         
  IntWritable.class,  
  job);
TableMapReduceUtil.initTableReducerJob(
  targetTable,        
  MyTableReducer.class,    
  job);
job.setNumReduceTasks(1);   // 设置Reduce的数量为1

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

Mapper：

public static class MyMapper extends TableMapper<Text, IntWritable>  {
  public static final byte[] CF = "cf".getBytes();
  public static final byte[] ATTR1 = "attr1".getBytes();

  private final IntWritable ONE = new IntWritable(1);
  private Text text = new Text();

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    String val = new String(value.getValue(CF, ATTR1));
    text.set(val);     
    context.write(text, ONE);
  }
}

Reducer：

public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable>  {
  public static final byte[] CF = "cf".getBytes();
  public static final byte[] COUNT = "count".getBytes();

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int i = 0;
    for (IntWritable val : values) {
      i += val.get();
    }
    Put put = new Put(Bytes.toBytes(key.toString()));
    put.add(CF, COUNT, Bytes.toBytes(i));

    context.write(null, put);
  }
}

官方示例三：对表中的某一列数据计数并写入HDFS

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummaryToFile");
job.setJarByClass(MySummaryFileJob.class);     

Scan scan = new Scan();
scan.setCaching(500);        
scan.setCacheBlocks(false);  

TableMapReduceUtil.initTableMapperJob(
  sourceTable,        
  scan,               
  MyMapper.class,     
  Text.class,         
  IntWritable.class,  
  job);
job.setReducerClass(MyReducer.class);    
job.setNumReduceTasks(1);    
FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // 根据需要调整目录
boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

这里的Mapper和示例二中的一致，至于Reducer，它是一个“通用”Reducer而不是扩展TableMapper并发出Puts。

public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>  {

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int i = 0;
    for (IntWritable val : values) {
      i += val.get();
    }
    context.write(key, new IntWritable(i));
  }
}

官方示例四：将计数结果写入到RDBMS

可以在生命周期方法setup中建立连接，在cleanup中关闭连接，需要注意的是作业中的Reduce越多建立的连接也越多。

public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable>  {

  private Connection c = null;

  public void setup(Context context) {
    // 建立数据库连接
  }

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    // 在这个例子中key是Text
  }

  public void cleanup(Context context) {
    // 关闭数据库连接
  }

}

官方示例五：在MR作业中访问其他HBase表

通过在Mapper的setup方法中创建Table实例，可以在MapReduce作业中将其他HBase表作为查找表等进行访问。

public class MyMapper extends TableMapper<Text, LongWritable> {
  private Table myOtherTable;

  public void setup(Context context) {
    // 在这里创建一个到集群的连接，并保存它，或者使用连接
    // 来自于以存在的表
    myOtherTable = connection.getTable("myOtherTable");
  }

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    // 处理结果...
    // 使用 'myOtherTable' 查找
  }

示例六：将HDFS上的文件写入HBase表中（Map没有继承TableMapper的方式）

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum", "master,slave01,slave02");
        Job job = Job.getInstance(config, CountryToHBase.class.getName());
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Put.class);
        //只有map没有reduce，所以设置reduce的数目为0
        job.setNumReduceTasks(0);
        //设置数据的输入路径,没有使用参数，直接在程序中写入HDFS的路径
        FileInputFormat.setInputPaths(job, new Path("/output/weather/country_weather/part-r-00000"));
        Connection connection = ConnectionFactory.createConnection(config);
        Admin admin = connection.getAdmin();
        //创建表
        TableName tb_name = TableName.valueOf("country_weather");
        HTableDescriptor desc = new HTableDescriptor(tb_name);
        HColumnDescriptor cf = new HColumnDescriptor("field");
        //设置列族
        desc.addFamily(cf);
        admin.createTable(desc);
        //驱动函数
        TableMapReduceUtil.initTableReducerJob("country_weather",null, job);
        TableMapReduceUtil.addDependencyJars(job);
        job.setJarByClass(MyMapper.class);
        job.waitForCompletion(true);


    }

    private static class MyMapper extends Mapper<LongWritable, Text, NullWritable, Put> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] splits = value.toString().split(" ");
            Put put = new Put( key.toString().getBytes());
            put.addColumn("field".getBytes(),"province".getBytes(),splits[0].getBytes());
            put.addColumn("field".getBytes(),"adcode".getBytes(),splits[1].getBytes());
            put.addColumn("field".getBytes(),"longitude".getBytes(),splits[2].getBytes());
            put.addColumn("field".getBytes(),"latitude".getBytes(),splits[3].getBytes());
            put.addColumn("field".getBytes(),"year".getBytes(),splits[4].getBytes());
            put.addColumn("field".getBytes(),"month".getBytes(),splits[5].getBytes());
            put.addColumn("field".getBytes(),"tem".getBytes(),splits[6].getBytes());
            put.addColumn("field".getBytes(),"rhu".getBytes(),splits[7].getBytes());
            put.addColumn("field".getBytes(),"ssd".getBytes(),splits[8].getBytes());
            put.addColumn("field".getBytes(),"pre".getBytes(),splits[9].getBytes());
            context.write(NullWritable.get(), put);
        }
    }

MapReduce替代API Cascading

实际上使用的还是MapReduce，但可以用简化的方式编写MapReduce代码

下面例子表示将数据写入HBase，同样也可以从HBase获取数据

// 从默认的文件系统读数据
// 提交两个字段: "offset" 和 "line"
Tap source = new Hfs( new TextLine(), inputFileLhs );

// 在一个HBase集群中存储数据
// 接收字段 "num", "lower", and "upper"
// 自动将传入字段的范围扩展到它们的正确家族名称, "left" 或 "right"
Fields keyFields = new Fields( "num" );
String[] familyNames = {"left", "right"};
Fields[] valueFields = new Fields[] {new Fields( "lower" ), new Fields( "upper" ) };
Tap hBaseTap = new HBaseTap( "multitable", new HBaseScheme( keyFields, familyNames, valueFields ), SinkMode.REPLACE );

// 用于解析输入字段的简单管道程序集
// 一个真实的应用程序可能将多个管道链接在一起，以便进行更复杂的处理
Pipe parsePipe = new Each( "insert", new Fields( "line" ), new RegexSplitter( new Fields( "num", "lower", "upper" ), " " ) );

// "plan" 是一个集群可执行流
// 将source Tap和hBaseTap（sink Tap）连接到parsePipe
Flow parseFlow = new FlowConnector( properties ).connect( source, hBaseTap, parsePipe );

// 启动流，并阻塞直到完成
parseFlow.complete();

// 打开在HBase表中填充数据的迭代器
TupleEntryIterator iterator = parseFlow.openSink();

while(iterator.hasNext())
  {
  // 从HBase打印出每个元组
  System.out.println( "iterator.next() = " + iterator.next() );
  }

iterator.close();

参考文章：http://hbase.apache.org/1.2/book.html#mapreduce.htable.access

To_Drill

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
HBase与MapReduce整合

一、如果在MapReuduce中使用HBase的API操作HBase时出现下面的错误Error: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration解决方式一：在yarn的时候添加jar包的另一种方式export libpath=/path to hbase jarhadoop...
复制链接

扫一扫