分布式空间分析引擎-Simba架构分析与源码阅读之SparkPlan优化

最新推荐文章于 2022-11-22 15:28:03 发布

airfan92

最新推荐文章于 2022-11-22 15:28:03 发布

阅读量435

点赞数 1

分类专栏： # 分布式空间分析引擎-Simba架构分析与源码阅读文章标签： Simba OLAP 空间计算 Spark SQL 分布式

本文链接：https://blog.csdn.net/u013970710/article/details/103218979

版权

分布式空间分析引擎-Simba架构分析与源码阅读专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Simba在计算过程中主要进行了两方面的SparkPlan优化：1）基于索引的谓词下推；2）一些分区器的分区数动态计算。对这些概念有些模糊的朋友可以参见前几篇博客。

基于索引的谓词下推

为了减少参与join运算的计算量，simba专门为spatial join设计了谓词下推的优化，这部分主要是通过org.apache.spark.sql.simba.SimbaSessionState、org.apache.spark.sql.simba.SimbaOptimizer和org.apache.spark.sql.simba.util.PredicateUtil这三个类来实现，通过扩展SessionState以及继承SparkOptimizer增加对logicalPlan的优化逻辑实现谓词下推。

首先，Simba会规定在由logicalPlan生成physicalPlan的时候，对建立了索引的列涉及的谓词做下推：

def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
      case PhysicalOperation(projectList, filters, indexed: IndexedRelation) =>
        val predicatesCanBeIndexed = selectFilter(filters)
        val parentFilter = // if all predicate can be indexed, then remove the predicate
          if (predicatesCanBeIndexed.toString // TODO ugly hack
            .compareTo(Seq(filters.reduceLeftOption(And).getOrElse(true)).toString) == 0) Seq[Expression]()
          else filters
        pruneFilterProjectionForIndex(
          projectList,
          parentFilter,
          identity[Seq[Expression]],
          IndexedRelationScan(_, predicatesCanBeIndexed, indexed)) :: Nil
      case _ => Nil
    }

simba谓词下推示意图

Optimizer类中会进一步过滤出包含Spatial Join的filter，指定进行谓词分解，然后利用建立了索引的列涉及的谓词过滤出的newLeft和newRight。之后在newLeft和newRight上基于剩下的commonFilters和JoinCondition做Spatial Join：

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    // push the where condition down into join filter
    case f @ Filter(filterCondition, SpatialJoin(left, right, joinType, joinCondition)) =>
      val (leftFilterConditions, rightFilterConditions, commonFilterCondition) =
        split(splitConjunctivePredicates(filterCondition), left, right)
      val newLeft = leftFilterConditions.reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
      val newRight = rightFilterConditions.reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
      val newJoinCond = (commonFilterCondition ++ joinCondition).reduceLeftOption(And)
      SpatialJoin(newLeft, newRight, joinType, newJoinCond)
    // push down the join filter into sub query scanning if applicable
    case f @ SpatialJoin(left, right, joinType, joinCondition) =>
      val (leftJoinConditions, rightJoinConditions, commonJoinCondition) =
        split(joinCondition.map(splitConjunctivePredicates).getOrElse(Nil), left, right)
      val newLeft = leftJoinConditions.reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
      val newRight = rightJoinConditions.reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
      val newJoinCond = commonJoinCondition.reduceLeftOption(And)
      SpatialJoin(newLeft, newRight, joinType, newJoinCond)
  }

分区数动态计算

在Simba的spatial Join计算中，对数据进行分区（partition）是一种非常频繁的操作，分区数的选择非常重要，simba对一部分分区器的分区数选择过程进行了优化。分区数计算可以分为两种，动态数据分区数计算和静态数据分区数计算，在数据固定的场景下分区数的计算往往相对容易，而对于一些中间计算产生的无法预知数据大小、形态的动态分区数计算往往更困难一些。simba的动态分区数计算逻辑如下：

val (data_bounds, total_size) = {
      rdd.aggregate[(Bounds, Long)]((null, 0))((bound, data) => {
        val new_bound = if (bound._1 == null) {
          Bounds(data._1.coord, data._1.coord)
        } else {
          Bounds(bound._1.min.zip(data._1.coord).map(x => Math.min(x._1, x._2)),
            bound._1.max.zip(data._1.coord).map(x => Math.max(x._1, x._2)))
        }
        (new_bound, bound._2 + SizeEstimator.estimate(data._1))
      }, (left, right) => {
        val new_bound = {
          if (left._1 == null) right._1
          else if (right._1 == null) left._1
          else {
            Bounds(left._1.min.zip(right._1.min).map(x => Math.min(x._1, x._2)),
              left._1.max.zip(right._1.max).map(x => Math.max(x._1, x._2)))
          }}
        (new_bound, left._2 + right._2)
      })
    }
    val seed = System.currentTimeMillis()
    // TODO a better sample strategy is needed
    val sampled =  if (total_size * sample_rate <= 0.05 * transfer_threshold) {
      rdd.mapPartitions(part => part.map(_._1)).collect()
    } else if (total_size * sample_rate <= transfer_threshold) {
      rdd.sample(withReplacement = false, sample_rate, seed).map(_._1).collect()
    } else {
      rdd.sample(withReplacement = false, transfer_threshold.toDouble / total_size, seed)
        .map(_._1).collect()
    }

首先会计算出数据的整体分布边界和整体空间占用大小，然后根据整体空间找到一个合适的取样数量，在不过多占用空间的基础上尽量多的抽取数据，最后根据抽取到的数据和整体分布边界执行STR算法确定分区数大小。

airfan92

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
分布式空间分析引擎-Simba架构分析与源码阅读之SparkPlan优化

Simba在计算过程中主要进行了两方面的SparkPlan优化：1）基于索引的谓词下推；2）一些分区器的分区数动态计算。对这些概念有些模糊的朋友可以参见前几篇博客。基于索引的谓词下推为了减少参与join运算的计算量，simba专门为spatial join设计了谓词下推的优化，这部分主要是通过org.apache.spark.sql.simba.SimbaSessionState、org....
复制链接

扫一扫