2021SC@SDUSC
概述
本次继续分析pig作为hadoop的轻量级脚本语言操作hadoop的executionengine包下的MultiQueryOptimizer类的代码。
该类继承自MROpPlanVisitor,用于创建一个优化器,它将全部或部分拆分器MapReduceOpers合并到拆分器MapReduceOper中。此优化器通过将 POLoad/POStore 组合替换为 POSplit 运算符来合并这些MapReduceOpers。
isDiamondMROper方法
当这个 MR 是一个无关的 MR 时,才会删除此 MR 作为菱形查询优化的一部分,也就是说,它的计划有两个运算符(加载后跟存储)或三个运算符(加载和存储之间的运算符必须是 foreach,由强制转换操作引入)。
private boolean isDiamondMROper(MapReduceOper mr) {
boolean rtn = false;
if (isMapOnly(mr)) {
PhysicalPlan pl = mr.mapPlan;
if (pl.size() == 2 || pl.size() == 3) {
PhysicalOperator root = pl.getRoots().get(0);
PhysicalOperator leaf = pl.getLeaves().get(0);
if (root instanceof POLoad && leaf instanceof POStore) {
if (pl.size() == 3) {
PhysicalOperator mid = pl.getSuccessors(root).get(0);
if (mid instanceof POForEach) {
rtn = true;
}
} else {
rtn = true;
}
}
}
}
return rtn;
}
isSplitteeMergeable方法
private boolean isSplitteeMergeable(MapReduceOper splittee) {
if (splittee.isGlobalSort() || splittee.isLimitAfterSort()) {
log.info("Cannot merge this splittee: " +
"it is global sort or limit after sort");
return false;
}
PhysicalOperator leaf = splittee.mapPlan.getLeaves().get(0);
if (!(leaf instanceof POLocalRearrange) &&
! (leaf instanceof POSplit)) {
log.info("Cannot merge this splittee: " +
"its map plan doesn't end with LR or Split operator: "
+ leaf.getClass().getName());
return false;
}
if (splittee.needsDistinctCombiner()) {
log.info("Cannot merge this splittee: " +
"it has distinct combiner.");
return false;
}
return true;
}
if (splittee.isGlobalSort() || splittee.isLimitAfterSort())
判断排序类别, 不能全局排序或逐个排序限制,它们使用的是不同的分区程序
if (!(leaf instanceof POLocalRearrange) &&! (leaf instanceof POSplit))
检查计划叶:仅合并本地重新排列或拆分
if (splittee.needsDistinctCombiner())
不能有不同的组合器,它使用不同的组合器
mergeMapReduceSplittees方法
拆分器具有非空还原器,因此我们无法将MR拆分器合并到拆分器中。我们要做的是将多个拆分器(如果存在)合并到新的 MR 运算符中,并将其连接到拆分器。
private int mergeMapReduceSplittees(List<MapReduceOper> mapReducers,
MapReduceOper splitter) throws VisitorException {
List<MapReduceOper> mergeList = getMergeList(splitter, mapReducers);
if (mergeList.size() <= 1) {
return 0;
}
MapReduceOper mrOper = getMROper();
MapReduceOper splittee = mergeList.get(0);
PhysicalPlan pl = splittee.mapPlan;
POLoad load = (POLoad)pl.getRoots().get(0);
mrOper.mapPlan.add(load);
try {
mrOper.mapPlan.addAsLeaf(getStore());
} catch (PlanException e) {
int errCode = 2137;
String msg = "Internal Error. Unable to add store to the plan as leaf for optimization.";
throw new OptimizerException(msg, errCode, PigException.BUG, e);
}
try {
getPlan().add(mrOper);
getPlan().connect(splitter, mrOper);
} catch (PlanException e) {
int errCode = 2133;
String msg = "Internal Error. Unable to connect splitter with successors for optimization.";
throw new OptimizerException(msg, errCode, PigException.BUG, e);
}
mergeAllMapReduceSplittees(mergeList, mrOper, getSplit());
return (mergeList.size() - 1);
}
try {
mrOper.mapPlan.addAsLeaf(getStore());
} catch (PlanException e) {
int errCode = 2137;
String msg = “Internal Error. Unable to add store to the plan as leaf for optimization.”;
throw new OptimizerException(msg, errCode, PigException.BUG, e);
}
添加一个虚拟存储运算符,稍后将由拆分运算符替换。
try {
getPlan().add(mrOper);
getPlan().connect(splitter, mrOper);
} catch (PlanException e) {
int errCode = 2133;
String msg = “Internal Error. Unable to connect splitter with successors for optimization.”;
throw new OptimizerException(msg, errCode, PigException.BUG, e);
}
将新的 MR 操作连接到分配器
mergeAllMapReduceSplittees(mergeList, mrOper, getSplit());
将拆分者合并到新的 MR 操作中
总结
本学期的Apache Pig的代码分析到此为止,总的来说,在此次代码分析中对于这种pig这种分析数据集的工具类的后端有了更加清晰的认识。