7.2. HBase MapReduce Examples
The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined as follows...
Configuration config = HBaseConfiguration.create(); Job job = new Job(config, "ExampleRead"); job.setJarByClass(MyReadJob.class); // class that contains mapper Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs ... TableMapReduceUtil.initTableMapperJob( tableName, // input HBase table name scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper null, // mapper output key null, // mapper output value job); job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }
...and the mapper instance would extend TableMapper...
public static class MyMapper extends TableMapper<Text, Text> { public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { // process data for the row from the Result instance. } }
The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleReadWrite"); job.setJarByClass(MyReadWriteJob.class); // class that contains mapper Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class null, // mapper output key null, // mapper output value job); TableMapReduceUtil.initTableReducerJob( targetTable, // output table null, // reducer class job); job.setNumReduceTasks(0); boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }
An explanation is required of what TableMapReduceUtil
is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable
and reducer value to Writable
. These could be set by the programmer on the job and conf, but TableMapReduceUtil
tries to make things easier.
The following is the example mapper, which will create a Put
and matching the input Result
and emit it. Note: this is what the CopyTable utility does.
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { // this example is just copying the data from the source table... context.write(row, resultToPut(row,value)); } private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { Put put = new Put(key.get()); for (KeyValue kv : result.raw()) { put.add(kv); } return put; } }
There isn't actually a reducer step, so TableOutputFormat
takes care of sending the Put
to the target table.
This is just an example, developers could choose not to use TableOutputFormat
and connect to the target table themselves.
TODO: example for MultiTableOutputFormat
.
The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummary"); job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class Text.class, // mapper output key IntWritable.class, // mapper output value job); TableMapReduceUtil.initTableReducerJob( targetTable, // output table MyTableReducer.class, // reducer class job); job.setNumReduceTasks(1); // at least one, adjust as required boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }
In this example mapper a column with a String-value is chosen as the value to summarize upon. This value is used as the key to emit from the mapper, and an IntWritable
represents an instance counter.
public static class MyMapper extends TableMapper<Text, IntWritable> { public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR1 = "attr1".getBytes(); private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { String val = new String(value.getValue(CF, ATTR1)); text.set(val); // we can only emit Writables... context.write(text, ONE); } }
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put
.
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { public static final byte[] CF = "cf".getBytes(); public static final byte[] COUNT = "count".getBytes(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(CF, COUNT, Bytes.toBytes(i)); context.write(null, put); } }
This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummaryToFile"); job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class Text.class, // mapper output key IntWritable.class, // mapper output value job); job.setReducerClass(MyReducer.class); // reducer class job.setNumReduceTasks(1); // at least one, adjust as required FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } context.write(key, new IntWritable(i)); } }
It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue
would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the cleanup
method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys.
In the end, the summary results are in HBase.
Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer. The setup
method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.
It is critical to understand that number of reducers for the job affects the summarization implementation, and you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Connection c = null; public void setup(Context context) { // create DB connection... } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // do summarization // in this example the keys are Text, but this is just an example } public void cleanup(Context context) { // close db connection } }
In the end, the summary results are written to your RDBMS table/s.
首先,可以设置scan的startRow, stopRow, filter等属性。于是两种方案:
1.设置scan的filter,然后执行mapper,再reducer成一份结果
2.不用filter过滤,将filter做的事传给mapper做
进行了测试,前者在执行较少量scan记录的时候效率较后者高,但是执行的scan数量多了,便容易导致超时无返回而退出的情况。而为了实现后者,学会了如何向mapper任务中传递参数,走了一点弯路。
最后的一点思考是,用后者效率仍然不高,即便可用前者时效率也不高,因为默认的tablemapper是将对一个region的scan任务放在了一个mapper里,而我一个region有2G多,而我查的数据只占七八个region。于是,想能不能不以region为单位算做mapper,如果不能改,那只有用MR直接操作HBase底层HDFS文件了,这个,…,待研究。
上代码(为了保密,将表名啊,列名列族名啊都改了一下,有改漏的,大家当做没看见啊,另:主要供大家参考下方法,即用mr来查询海量hbase数据,还有如何向mapper传参数):
- package mapreduce.hbase;
- import java.io.IOException;
- import mapreduce.HDFS_File;
- import org.apache.commons.logging.Log;
- import org.apache.commons.logging.LogFactory;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.hbase.HBaseConfiguration;
- import org.apache.hadoop.hbase.client.Result;
- import org.apache.hadoop.hbase.client.Scan;
- import org.apache.hadoop.hbase.filter.Filter;
- import org.apache.hadoop.hbase.filter.FilterList;
- import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
- import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
- import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
- import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
- import org.apache.hadoop.hbase.mapreduce.TableMapper;
- import org.apache.hadoop.hbase.util.Bytes;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper.Context;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- /**
- * 用MR对HBase进行查找,给出Scan的条件诸如startkey endkey;以及filters用来过滤掉不符合条件的记录 LicenseTable
- * 的 RowKey 201101010000000095\xE5\xAE\x81WDTLBZ
- *
- * @author Wallace
- *
- */
- @SuppressWarnings("unused")
- public class MRSearchAuto {
- private static final Log LOG = LogFactory.getLog(MRSearchAuto.class);
- private static String TABLE_NAME = "tablename";
- private static byte[] FAMILY_NAME = Bytes.toBytes("cfname");
- private static byte[][] QUALIFIER_NAME = { Bytes.toBytes("col1"),
- Bytes.toBytes("col2"), Bytes.toBytes("col3") };
- public static class SearchMapper extends
- TableMapper<ImmutableBytesWritable, Text> {
- private int numOfFilter = 0;
- private Text word = new Text();
- String[] strConditionStrings = new String[]{"","",""}/* { "新C87310", "10", "2" } */;
- /*
- * private void init(Configuration conf) throws IOException,
- * InterruptedException { strConditionStrings[0] =
- * conf.get("search.license").trim(); strConditionStrings[1] =
- * conf.get("search.carColor").trim(); strConditionStrings[2] =
- * conf.get("search.direction").trim(); LOG.info("license: " +
- * strConditionStrings[0]); }
- */
- protected void setup(Context context) throws IOException,
- InterruptedException {
- strConditionStrings[0] = context.getConfiguration().get("search.license").trim();
- strConditionStrings[1] = context.getConfiguration().get("search.color").trim();
- strConditionStrings[2] = context.getConfiguration().get("search.direction").trim();
- }
- protected void map(ImmutableBytesWritable key, Result value,
- Context context) throws InterruptedException, IOException {
- String string = "";
- String tempString;
- /**/
- for (int i = 0; i < 1; i++) {
- // /在此map里进行filter的功能
- tempString = Text.decode(value.getValue(FAMILY_NAME,
- QUALIFIER_NAME[i]));
- if (tempString.equals(/* strConditionStrings[i] */"新C87310")) {
- LOG.info("新C87310. conf: " + strConditionStrings[0]);
- if (tempString.equals(strConditionStrings[i])) {
- string = string + tempString + " ";
- } else {
- return;
- }
- }
- else {
- return;
- }
- }
- word.set(string);
- context.write(null, word);
- }
- }
- public void searchHBase(int numOfDays) throws IOException,
- InterruptedException, ClassNotFoundException {
- long startTime;
- long endTime;
- Configuration conf = HBaseConfiguration.create();
- conf.set("hbase.zookeeper.quorum", "node2,node3,node4");
- conf.set("fs.default.name", "hdfs://node1");
- conf.set("mapred.job.tracker", "node1:54311");
- /*
- * 传递参数给map
- */
- conf.set("search.license", "新C87310");
- conf.set("search.color", "10");
- conf.set("search.direction", "2");
- Job job = new Job(conf, "MRSearchHBase");
- System.out.println("search.license: " + conf.get("search.license"));
- job.setNumReduceTasks(0);
- job.setJarByClass(MRSearchAuto.class);
- Scan scan = new Scan();
- scan.addFamily(FAMILY_NAME);
- byte[] startRow = Bytes.toBytes("2011010100000");
- byte[] stopRow;
- switch (numOfDays) {
- case 1:
- stopRow = Bytes.toBytes("2011010200000");
- break;
- case 10:
- stopRow = Bytes.toBytes("2011011100000");
- break;
- case 30:
- stopRow = Bytes.toBytes("2011020100000");
- break;
- case 365:
- stopRow = Bytes.toBytes("2012010100000");
- break;
- default:
- stopRow = Bytes.toBytes("2011010101000");
- }
- // 设置开始和结束key
- scan.setStartRow(startRow);
- scan.setStopRow(stopRow);
- TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan,
- SearchMapper.class, ImmutableBytesWritable.class, Text.class,
- job);
- Path outPath = new Path("searchresult");
- HDFS_File file = new HDFS_File();
- file.DelFile(conf, outPath.getName(), true); // 若已存在,则先删除
- FileOutputFormat.setOutputPath(job, outPath);// 输出结果
- startTime = System.currentTimeMillis();
- job.waitForCompletion(true);
- endTime = System.currentTimeMillis();
- System.out.println("Time used: " + (endTime - startTime));
- System.out.println("startRow:" + Text.decode(startRow));
- System.out.println("stopRow: " + Text.decode(stopRow));
- }
- public static void main(String args[]) throws IOException,
- InterruptedException, ClassNotFoundException {
- MRSearchAuto mrSearchAuto = new MRSearchAuto();
- int numOfDays = 1;
- if (args.length == 1)
- numOfDays = Integer.valueOf(args[0]);
- System.out.println("Num of days: " + numOfDays);
- mrSearchAuto.searchHBase(numOfDays);
- }
- }
开始时,我是在外面conf.set了传入的参数,而在mapper的init(Configuration)里get参数并赋给mapper对象。
将参数传给map运行时结果不对
for (int i = 0; i < 1; i++) {
// /在此map里进行filter的功能
tempString = Text.decode(value.getValue(FAMILY_NAME,
QUALIFIER_NAME[i]));
if (tempString.equals(/*strConditionStrings[i]*/"新C87310"))
string = string + tempString + " ";
else {
return;
}
}
如果用下面的mapper的init获取conf传来的参数,然后在上面map函数里进行调用,结果便不对了。
直接指定值时和参数传过来相同的值时,其output的结果分别为1条和0条。
private void init(Configuration conf) throws IOException,
InterruptedException {
strConditionStrings[0] = conf.get("search.licenseNumber").trim();
strConditionStrings[1] = conf.get("search.carColor").trim();
strConditionStrings[2] = conf.get("search.direction").trim();
}
加了个日志写
private static final Log LOG = LogFactory.getLog(MRSearchAuto.class);
init()函数里:
LOG.info("license: " + strConditionStrings[0]);
map里
if (tempString.equals(/* strConditionStrings[i] */"新C87310")) {
LOG.info("新C87310. conf: " + strConditionStrings[0]);
然后在网页 namenode:50030上看任务,最终定位到哪台机器执行了那个map,然后看日志
mapreduce.hbase.TestMRHBase: 新C87310. conf: null
在conf.set之后我也写了下,那时正常,但是在map里却是null了,而在map类的init函数打印的却没有打印。
因此,问题应该是:
map类的init()函数没有执行到!
于是init()的获取conf中参数值并赋给map里变量的操作便未执行,同时打印日志也未执行。
OK!看怎么解决
放在setup里获取
protected void setup(Context context) throws IOException,
InterruptedException {
// strConditionStrings[0] = context.getConfiguration().get("search.license").trim();
// strConditionStrings[1] = context.getConfiguration().get("search.color").trim();
// strConditionStrings[2] = context.getConfiguration().get("search.direction").trim();
}
报错
12/01/12 11:21:56 INFO mapred.JobClient: map 0% reduce 0%
12/01/12 11:22:03 INFO mapred.JobClient: Task Id : attempt_201201100941_0071_m_000000_0, Status : FAILED
java.lang.NullPointerException
at mapreduce.hbase.MRSearchAuto$SearchMapper.setup(MRSearchAuto.java:66)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201100941_0071_m_000000_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201201100941_0071_m_000000_0: log4j:WARN Please initialize the log4j system properly.
12/01/12 11:22:09 INFO mapred.JobClient: Task Id : attempt_201201100941_0071_m_000000_1, Status : FAILED
java.lang.NullPointerException
at mapreduce.hbase.MRSearchAuto$SearchMapper.setup(MRSearchAuto.java:66)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
然后将setup里的东西注释掉,无错,错误应该在context上,进一步确认,在里面不用context,直接赋值,有结果,好!
说明是context的事了,NullPointerException,应该是context.getConfiguration().get("search.license")这些中有一个是null的。
突然想起来,改了下get时候的属性,而set时候没改,于是不对应,于是context.getConfiguration().get("search.color")及下面的一项都是null,null.trim()报的异常。
conf.set("search.license", "新C87310");
conf.set("search.color", "10");
conf.set("search.direction", "2");
修改后,问题解决。
实现了向map中传参数