Spark SQL Catalyst源码分析之Physical Plan

最新推荐文章于 2025-03-19 17:33:05 发布

原创

最新推荐文章于 2025-03-19 17:33:05 发布

· 1.1w 阅读

3 ·

版权

文章标签：

#spark sql #catalyst #sql #spark #shark

本文深入探讨Spark SQL的Catalyst优化器中的Physical Plan，涵盖SparkPlanner、多种Join策略、聚合操作及基本运算。通过源码分析，揭示Spark SQL执行Spark job的前序步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

/** Spark SQL源码分析系列文章*/

前面几篇文章主要介绍的是spark sql包里的的spark sql执行流程，以及Catalyst包内的SqlParser，Analyzer和Optimizer，最后要介绍一下Catalyst里最后的一个Plan了，即Physical Plan。物理计划是Spark SQL执行Spark job的前置，也是最后一道计划。

如图：

一、SparkPlanner

话接上回，Optimizer接受输入的Analyzed Logical Plan后，会有SparkPlanner来对Optimized Logical Plan进行转换，生成Physical plans。

lazy val optimizedPlan = optimizer(analyzed)
    // TODO: Don't just pick the first one...
    lazy val sparkPlan = planner(optimizedPlan).next()

SparkPlanner的apply方法，会返回一个Iterator[PhysicalPlan]。
SparkPlanner继承了SparkStrategies，SparkStrategies继承了QueryPlanner。
SparkStrategies包含了一系列特定的Strategies，这些Strategies是继承自QueryPlanner中定义的Strategy，它定义接受一个Logical Plan，生成一系列的Physical Plan

  @transient
  protected[sql] val planner = new SparkPlanner
  
    protected[sql] class SparkPlanner extends SparkStrategies {
    val sparkContext: SparkContext = self.sparkContext

    val sqlContext: SQLContext = self

    def numPartitions = self.numShufflePartitions //partitions的个数

    val strategies: Seq[Strategy] =  //策略的集合
      CommandStrategy(self) ::
      TakeOrdered ::
      PartialAggregation ::
      LeftSemiJoin ::
      HashJoin ::
      InMemoryScans ::
      ParquetOperations ::
      BasicOperators ::
      CartesianProduct ::
      BroadcastNestedLoopJoin :: Nil
	 etc......
	 }

QueryPlanner 是SparkPlanner的基类，定义了一系列的关键点，如Strategy，planLater和apply。

abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] {
  /** A list of execution strategies that can be used by the planner */
  def strategies: Seq[Strategy]

  /**
   * Given a [[plans.logical.LogicalPlan LogicalPlan]], returns a list of `PhysicalPlan`s that can
   * be used for execution. If this strategy does not apply to the give logical operation then an
   * empty list should be returned.
   */
  abstract protected class Strategy extends Logging {
    def apply(plan: LogicalPlan): Seq[PhysicalPlan]  //接受一个logical plan，返回Seq[PhysicalPlan]
  }

  /**
   * Returns a placeholder for a physical plan that executes `plan`. This placeholder will be
   * filled in automatically by the QueryPlanner using the other execution strategies that are
   * available.
   */
  protected def planLater(plan: LogicalPlan) = apply(plan).next() //返回一个占位符，占位符会自动被QueryPlanner用其它的strategies apply

  def apply(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    // Obviously a lot to do here still...
    val iter = strategies.view.flatMap(_(plan)).toIterator //整合所有的Strategy，_(plan)每个Strategy应用plan上，得到所有Strategies执行完后生成的所有Physical Plan的集合，一个iter
    assert(iter.hasNext, s"No plan for $plan")
    iter //返回所有物理计划
  }
}

继承关系：

二、Spark Plan

Spark Plan是Catalyst里经过所有Strategies apply 的最终的物理执行计划的抽象类，它只是用来执行spark job的。

 lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)

prepareForExecution其实是一个RuleExecutor[SparkPlan]，当然这里的Rule就是SparkPlan了。

 @transient
  protected[sql] val prepareForExecution = new RuleExecutor[SparkPlan] {
    val batches =
      Batch("Add exchange", Once, AddExchange(self)) :: //添加shuffler操作如果必要的话
      Batch("Prepare Expressions", Once, new BindReferences[SparkPlan]) :: Nil //Bind references
  }

Spark Plan继承Query Plan[Spark Plan]，里面定义的partition，requiredChildDistribution以及spark sql启动执行的execute方法。

abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging {
  self: Product =>

  // TODO: Move to `DistributedPlan`
  /** Specifies how data is partitioned across different nodes in the cluster. */
  def outputPartitioning: Partitioning = UnknownPartitioning(0) // TODO: WRONG WIDTH!
  /

最低0.47元/天解锁文章