Simba在计算过程中主要进行了两方面的SparkPlan优化:1)基于索引的谓词下推;2)一些分区器的分区数动态计算。对这些概念有些模糊的朋友可以参见前几篇博客。
基于索引的谓词下推
为了减少参与join运算的计算量,simba专门为spatial join设计了谓词下推的优化,这部分主要是通过org.apache.spark.sql.simba.SimbaSessionState、org.apache.spark.sql.simba.SimbaOptimizer和org.apache.spark.sql.simba.util.PredicateUtil这三个类来实现,通过扩展SessionState以及继承SparkOptimizer增加对logicalPlan的优化逻辑实现谓词下推。
首先,Simba会规定在由logicalPlan生成physicalPlan的时候,对建立了索引的列涉及的谓词做下推:
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case PhysicalOperation(projectList, filters, indexed: IndexedRelation) =>
val predicatesCanBeIndexed = selectFilter(filters)
val parentFilter = // if all predicate can be indexed, then remove the predicate
if (predicatesCanBeIndexed.toString // TODO ugly hack
.compareTo(Seq(filters.reduceLeftOption(And).getOrElse(true)).toString) == 0) Seq[Expression]()
else filters
pruneFilterProjectionForIndex(
projectList,
parentFilter,
identity[Seq[Expression]],
IndexedRelationScan(_, predicatesCanBeIndexed, indexed)) :: Nil
case _ => Nil
}
simba谓词下推示意图
Optimizer类中会进一步过滤出包含Spatial Join的filter,指定进行谓词分解,然后利用建立了索引的列涉及的谓词过滤出的newLeft和newRight。之后在newLeft和newRight上基于剩下的commonFilters和JoinCondition做Spatial Join:
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
// push the where condition down into join filter
case f @ Filter(filterCondition, SpatialJoin(left, right, joinType, joinCondition)) =>
val (leftFilterConditions, rightFilterConditions, commonFilterCondition) =
split(splitConjunctivePredicates(filterCondition), left, right)
val newLeft = leftFilterConditions.reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
val newRight = rightFilterConditions.reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
val newJoinCond = (commonFilterCondition ++ joinCondition).reduceLeftOption(And)
SpatialJoin(newLeft, newRight, joinType, newJoinCond)
// push down the join filter into sub query scanning if applicable
case f @ SpatialJoin(left, right, joinType, joinCondition) =>
val (leftJoinConditions, rightJoinConditions, commonJoinCondition) =
split(joinCondition.map(splitConjunctivePredicates).getOrElse(Nil), left, right)
val newLeft = leftJoinConditions.reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
val newRight = rightJoinConditions.reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
val newJoinCond = commonJoinCondition.reduceLeftOption(And)
SpatialJoin(newLeft, newRight, joinType, newJoinCond)
}
分区数动态计算
在Simba的spatial Join计算中,对数据进行分区(partition)是一种非常频繁的操作,分区数的选择非常重要,simba对一部分分区器的分区数选择过程进行了优化。分区数计算可以分为两种,动态数据分区数计算和静态数据分区数计算,在数据固定的场景下分区数的计算往往相对容易,而对于一些中间计算产生的无法预知数据大小、形态的动态分区数计算往往更困难一些。simba的动态分区数计算逻辑如下:
val (data_bounds, total_size) = {
rdd.aggregate[(Bounds, Long)]((null, 0))((bound, data) => {
val new_bound = if (bound._1 == null) {
Bounds(data._1.coord, data._1.coord)
} else {
Bounds(bound._1.min.zip(data._1.coord).map(x => Math.min(x._1, x._2)),
bound._1.max.zip(data._1.coord).map(x => Math.max(x._1, x._2)))
}
(new_bound, bound._2 + SizeEstimator.estimate(data._1))
}, (left, right) => {
val new_bound = {
if (left._1 == null) right._1
else if (right._1 == null) left._1
else {
Bounds(left._1.min.zip(right._1.min).map(x => Math.min(x._1, x._2)),
left._1.max.zip(right._1.max).map(x => Math.max(x._1, x._2)))
}}
(new_bound, left._2 + right._2)
})
}
val seed = System.currentTimeMillis()
// TODO a better sample strategy is needed
val sampled = if (total_size * sample_rate <= 0.05 * transfer_threshold) {
rdd.mapPartitions(part => part.map(_._1)).collect()
} else if (total_size * sample_rate <= transfer_threshold) {
rdd.sample(withReplacement = false, sample_rate, seed).map(_._1).collect()
} else {
rdd.sample(withReplacement = false, transfer_threshold.toDouble / total_size, seed)
.map(_._1).collect()
}
首先会计算出数据的整体分布边界和整体空间占用大小,然后根据整体空间找到一个合适的取样数量,在不过多占用空间的基础上尽量多的抽取数据,最后根据抽取到的数据和整体分布边界执行STR算法确定分区数大小。