HBase MapReduce集成

最新推荐文章于 2020-05-22 11:48:29 发布

chenzhikaida

最新推荐文章于 2020-05-22 11:48:29 发布

阅读量748

点赞数

HBase MapReduce案例

HBase MapReduce Read 案例

下面是一个关于在只读模式中使用hbase作为mapreduce输入的案例。特别的是，那是一个只有mapper实例但没有reducer，并且没有任何来自mapper输出。job定义如下…

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
  tableName,        // input HBase table name
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper
  null,             // mapper output key
  null,             // mapper output value
  job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}
…and the mapper instance would extend TableMapper…

public static class MyMapper extends TableMapper<Text, Text> {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
    // process data for the row from the Result instance.
   }
}

…mapper实例需要继承TableMapper…

public static class MyMapper extends TableMapper<Text, Text> {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
    // process data for the row from the Result instance.
   }
}

下面是一个关于使用HBase即作为mapreduce输入也作为mapreduce的输出通道。这个案例是将简单的从表中复制一份数据到另一个表中.

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
  sourceTable,      // input table
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper class
  null,             // mapper output key
  null,             // mapper output value
  job);
TableMapReduceUtil.initTableReducerJob(
  targetTable,      // output table
  null,             // reducer class
  job);
job.setNumReduceTasks(0);

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

解释一下TableMapReduceUtil是做什么的，主要是用于reducer。TableOutputFormat的用法和outputFormat class一样，几个参数在config中设置(e.g., TableOutputFormat.OUTPUT_TABLE)，和设置ruducer的output key为ImmutableBytesWritable，reducer value 为 Writable。这些可以通过在程序的job和conf中设置，但是TableMapReduceUtil尝试将这些操作变得简单

下面是一个mapper案例，创建一个put并且将匹配输入结果并且输出

public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    // this example is just copying the data from the source table...
      context.write(row, resultToPut(row,value));
    }

    private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
      Put put = new Put(key.get());
      for (KeyValue kv : result.raw()) {
        put.add(kv);
      }
      return put;
    }
}