HBase MapReduce案例
HBase MapReduce Read 案例
下面是一个关于在只读模式中使用hbase作为mapreduce输入的案例。特别的是,那是一个只有mapper实例但没有reducer,并且没有任何来自mapper输出。job定义如下…
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
…and the mapper instance would extend TableMapper…
public static class MyMapper extends TableMapper<Text, Text> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
// process data for the row from the Result instance.
}
}
…mapper实例需要继承TableMapper…
public static class MyMapper extends TableMapper<Text, Text> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
// process data for the row from the Result instance.
}
}
下面是一个关于使用HBase即作为mapreduce输入也作为mapreduce的输出通道。这个案例是将简单的从表中复制一份数据到另一个表中.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
解释一下TableMapReduceUtil是做什么的,主要是用于reducer。TableOutputFormat
的用法和outputFormat class一样,几个参数在config中设置(e.g., TableOutputFormat.OUTPUT_TABLE),和设置ruducer的output key为ImmutableBytesWritable,reducer value 为 Writable。这些可以通过在程序的job和conf中设置,但是TableMapReduceUtil
尝试将这些操作变得简单
下面是一个mapper案例,创建一个put
并且将匹配输入结果并且输出
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
// this example is just copying the data from the source table...
context.write(row, resultToPut(row,value));
}
private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
Put put = new Put(key.get());
for (KeyValue kv : result.raw()) {
put.add(kv);
}
return put;
}
}
这里没有实际的reducer阶段,所以TableOutputFormat
负责将put发送给目标表
这仅是一个案例,开发者可以选择不使用TableOutputFormat
连接到目标表