mapreduce任务map数量由InputFomat类的getSplits决定,例如FileInputFormat,则实现是通过输入源文件数决定map数量;mapreduce同时最多运行的任务数由yarn配置决定,为cpu vcores*节点数。它们两个一起影响着任务同时运行的容器数量(Running Containers)。
采用TableMapReduceUtil实现的从hbase读入数据进行处理的mapreduce程序,map数量由TableInputFormat控制,默认情况下由涉及的region数量决定(当然还有一些调整的机制,如hbase.mapreduce.inputtable.shufflemaps、hbase.mapreduce.input.autobalance,这些默认为false)。
如果想自定义调整操作hbase的map数量,有一个便捷的方法就是继承TableInputFormat类,重载它的getSplits方法,下面的例子代码通过一个配置参数hbase.mapreduce.icare.mergeStep设置map的数量,并会粗略按记录数均衡分割map涉及的region:
public class HBaseInputFormat extends TableInputFormat {
@Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<InputSplit> inputSplits = super.getSplits(context);
List<InputSplit> newInputSplits = new ArrayList<InputSplit>();
super.initialize(context);
TableName tName = super.getTable().getName();
int mergeStep = context.getConfiguration().getInt(
"hbase.mapreduce.icare.mergeStep", 13);
long totalRegionSize = 0;
for (int i = 0; i < inputSplits.size(); i++) {
TableSplit ts = (TableSplit) inputSplits.get(i);
totalRegionSize += ts.getLength();
}
long averageRegionSize = totalRegionSize / inputSplits.size();
long spiltTotalSize = averageRegionSize * mergeStep;
int index = 0;
while (index < inputSplits.size()) {
TableSplit ts = (TableSplit) inputSplits.get(index);
long totalSize = ts.getLength();
byte[] splitStartKey = ts.getStartRow();
byte[] splitEndKey = ts.getEndRow();
index++;
for (; index < inputSplits.size(); index++) {
TableSplit nextRegion = (TableSplit) inputSplits.get(index);
long nextRegionSize = nextRegion.getLength();
if (totalSize + nextRegionSize < spiltTotalSize) {
totalSize = totalSize + nextRegionSize;
splitEndKey = nextRegion.getEndRow();
} else {
break;
}
}
TableSplit tsNew = new TableSplit(tName, splitStartKey,
splitEndKey, ts.getRegionLocation(), totalSize);
newInputSplits.add(tsNew);
}
super.closeTable();
return newInputSplits;
}
}