Spark-SQL物理执行

最新推荐文章于 2022-10-20 17:55:15 发布

Echo Lee.

最新推荐文章于 2022-10-20 17:55:15 发布

阅读量1.4k

点赞数 1

分类专栏： spark

本文链接：https://blog.csdn.net/lisenyeahyeah/article/details/89331669

版权

spark 专栏收录该内容

4 篇文章 5 订阅

订阅专栏

文章目录

Spark-SQL物理执行

Spark-SQL物理执行

物理执行作为Spark-SQL执行过程中的最后一步，是将逻辑执行计划转换为物理执行计划SparkPlan，然后可以在Spark-Core上直接运行生成RDD。
Spark-Sql的整个执行过程其实在QueryExecution中定义得非常清楚，如代码所示：

//QueryExecution
//执行优化
  lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
//选取物理执行计划
  lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
//目前只是选取多个物理计划的第一个
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }
//执行前准备
  lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
//执行物理计划
  lazy val toRdd: RDD[InternalRow] = executedPlan.execute()

分区操作和分布情况（Partitioning和Distribution）

在spark中，分区一直是影响性能的重要指标。尤其是在Join和聚合的场景中，例如在plan1和plan2的Hash Join中，plan1和plan2的分区方式就必须要求是基于相同key的hash分布。如果是广播类型的Join，就要求至少有一边是广播变量数据分布。
Distribution是数据分布情况，Partitioning是分区操作
Distribution与Partitioning关联，定义了数据在集群各个节点上的分布情况
在这里插入图片描述
Distribution包括以下6种：

Distribution类型	描述
UnspecifiedDistribution	未指定分布
AllTuples	单分区，例如GlobalLimit算子
BroadcastDistribution	广播分布，数据会广播到所有节点上，构造参数mode为广播模式（BroadcastMode），例如Broadcast的Join操作中的requiredChildDistribution为[BroadcastDistribution(mode)]
ClusteredDistribution	构造参数clustering是Seq[Expression]类型，起到哈希函数的效果，经过clustering之后，相同的value数据会放到一个分区中，例如SortAggregateExec类型的Join操作中的requiredChildDistribution就是ClusteredDistribution(exprs)
HashClusteredDistribution	构造参数expressions是Seq[Expression]类型，起到哈希函数的效果，经过expressions之后，相同的value数据会放到一个分区中，例如SortMerge类型的Join操作中的requiredChildDistribution就是[HashClusteredDistribution(leftKeys), HashClusteredDistribution(reghtKeys)]
OrderedDistribution	构造参数ordering是Seq[SortOrder]类型，数据会根据ordering计算后的结果排序。在全局的Sort算子中，requiredChildDistribution就是[OrderedDistribution(sortOrder)]

Partitioning表示数据分区操作，如上图所示，介绍下内部重要的成员变量和函数

trait Partitioning {
  //该SparkPlan输出RDD的分区数目
  val numPartitions: Int
  //当前的partitioning操作能否得到所需的数据分布，当不满足时返回false，一般需要进行repartition操作，
  //对数据进行重新组织
  final def satisfies(required: Distribution): Boolean = {
    required.requiredNumPartitions.forall(_ == numPartitions) && satisfies0(required)
  }
  protected def satisfies0(required: Distribution): Boolean = required match {
    case UnspecifiedDistribution => true
    case AllTuples => numPartitions == 1
    case _ => false
  }
}

Partitioning包括以下7种

Partitioning类型	描述
UnknownPartitioning	不进行分区
RoundRobinPartitioning	在1-numPartitions范围内轮训式分区
HashPartitioning	基于Hash的分区
RangePartitioning	基于范围的分区
PartitioningCollection	分区方式的集合，描述物理算子的输出
BroadcastPartitioning	广播分区
DataSourcePartitioning	V2 DataSource的分区方式

物理计划（SparkPlan）

SparkPlan和LogicalPlan基本是一一对应的，和LogicalPlan类似，都继承自QueryPlan[PlanType <: QueryPlan[PlanType]]，命名规则都是XXXExec。SparkPlan的重要方法如下：

方法	作用描述
outputPartitioning	定义SparkPlan输出数据的分区方式
requiredChildDistribution	定义SparkPlan要求子节点遵守的分区方式
outputOrdering	定义SparkPlan输出数据的排序方式
requiredChildOrdering	定义SparkPlan要求子节点遵守的排序方式
doExecute	执行生成RDD

abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {
...
//定义SparkPlan输出数据的分区方式
def outputPartitioning: Partitioning = UnknownPartitioning(0)
//定义SparkPlan要求子节点遵守的分区方式
def requiredChildDistribution: Seq[Distribution] =
    Seq.fill(children.size)(UnspecifiedDistribution)
//定义SparkPlan输出数据的排序方式
def outputOrdering: Seq[SortOrder] = Nil
//定义SparkPlan要求子节点遵守的排序方式
def requiredChildOrdering: Seq[Seq[SortOrder]] = Seq.fill(children.size)(Nil)
//执行操作
protected def doExecute(): RDD[InternalRow]

在这里插入图片描述

下面介绍几个比较常见的SparkPlan：

投影（ProjectExec）

doExecute方法比较简单，就是执行子节点的execute方法并通过子节点的输出属性构造RDD[InternalRow]返回，其中输出的Ordering和Partitioning都取自子节点的排序和分区。

case class ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan)
  extends UnaryExecNode with CodegenSupport {
  override def output: Seq[Attribute] = projectList.map(_.toAttribute)
  override def inputRDDs(): Seq[RDD[InternalRow]] = {
    child.asInstanceOf[CodegenSupport].inputRDDs()
  }
  protected override def doExecute(): RDD[InternalRow] = {
    child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
      val project = UnsafeProjection.create(projectList, child.output,
        subexpressionEliminationEnabled)
      project.initialize(index)
      iter.map(project)
    }
  }
  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
  override def outputPartitioning: Partitioning = child.outputPartitioning
}

过滤（FilterExec）

过滤的execute方法也比较简单，也是执行子节点的execute方法，通过子节点的输出属性和condition构造predicate表达式，计算数据并返回RDD[InternalRow]

case class FilterExec(condition: Expression, child: SparkPlan)
  extends UnaryExecNode with CodegenSupport with PredicateHelper {
  protected override def doExecute(): RDD[InternalRow] = {
    val numOutputRows = longMetric("numOutputRows")
    child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
      val predicate = newPredicate(condition, child.output)
      predicate.initialize(0)
      iter.filter { row =>
        val r = predicate.eval(row)
        if (r) numOutputRows += 1
        r
      }
    }
  }
  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
  override def outputPartitioning: Partitioning = child.outputPartitioning
}

文件读取（FileSourceScanExec）

1.通过FileSourceStrategy如果匹配到LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)将生成FileSourceScanExec

//FileSourceStrategy
  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    case PhysicalOperation(projects, filters,
      l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) =>
      ...
      val scan =
        FileSourceScanExec(
          fsRelation,
          outputAttributes,
          outputSchema,
          partitionKeyFilters.toSeq,
          bucketSet,
          dataFilters,
          table.map(_.identifier))
     ...
  }

2.接下来看其doExecute方法，看看inputRDD是如何生成的

//FileSourceScanExec
  protected override def doExecute(): RDD[InternalRow] = {
    if (supportsBatch) {
      WholeStageCodegenExec(this)(codegenStageId = 0).execute()
    } else {
      val numOutputRows = longMetric("numOutputRows")
      if (needsUnsafeRowConversion) {
        inputRDD.mapPartitionsWithIndexInternal { (index, iter) =>
          val proj = UnsafeProjection.create(schema)
          proj.initialize(index)
          iter.map( r => {
            numOutputRows += 1
            proj(r)
          })
        }
      } else {
        inputRDD.map { r =>
          numOutputRows += 1
          r
        }
      }
    }
  }

3.如果设置了bucket，即按照设置的字段分区，则使用createBucketedReadRDD，否则用createNonBucketedReadRDD

//FileSourceScanExec
  private lazy val inputRDD: RDD[InternalRow] = {
    val readFile: (PartitionedFile) => Iterator[InternalRow] =
      relation.fileFormat.buildReaderWithPartitionValues(
        sparkSession = relation.sparkSession,
        dataSchema = relation.dataSchema,
        partitionSchema = relation.partitionSchema,
        requiredSchema = requiredSchema,
        filters = pushedDownFilters,
        options = relation.options,
        hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))
    relation.bucketSpec match {
      case Some(bucketing) if relation.sparkSession.sessionState.conf.bucketingEnabled =>
        createBucketedReadRDD(bucketing, readFile, selectedPartitions, relation)
      case _ =>
        createNonBucketedReadRDD(readFile, selectedPartitions, relation)
    }
  }

4.以createBucketedReadRDD为例来分析，最终生成FileScanRDD返回。

//FileSourceScanExec
  private def createBucketedReadRDD(
      bucketSpec: BucketSpec,
      readFile: (PartitionedFile) => Iterator[InternalRow],
      selectedPartitions: Seq[PartitionDirectory],
      fsRelation: HadoopFsRelation): RDD[InternalRow] = {
   ...
    val filePartitions = Seq.tabulate(bucketSpec.numBuckets) { bucketId =>
      FilePartition(bucketId, prunedFilesGroupedToBuckets.getOrElse(bucketId, Nil))
    }
    new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions)
  }

5.接下来看看FileScanRDD的compute方法，读取当前文件是通过readFunction，即构造inputRDD中定义的readFile，即FileFormat的buildReaderWithPartitionValues方法，buildReaderWithPartitionValues中构造的是ParquetFileReader，通过RecordReaderIterator迭代读取。返回的是Iterator[InternalRow]。

  override def compute(split: RDDPartition, context: TaskContext): Iterator[InternalRow] = {
    val iterator = new Iterator[Object] with AutoCloseable {
    ...
      private[this] val files = split.asInstanceOf[FilePartition].files.toIterator
      private[this] var currentFile: PartitionedFile = null
      private[this] var currentIterator: Iterator[Object] = null

      def hasNext: Boolean = {
        context.killTaskIfInterrupted()
        (currentIterator != null && currentIterator.hasNext) || nextIterator()
      }
      def next(): Object = {
        val nextElement = currentIterator.next()
        nextElement
      }

      private def readCurrentFile(): Iterator[InternalRow] = {
        try {
          readFunction(currentFile)
        } catch {
        }
      }
      
      private def nextIterator(): Boolean = {
        if (files.hasNext) {
          currentFile = files.next()
          InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)

          if (ignoreMissingFiles || ignoreCorruptFiles) {
            currentIterator = new NextIterator[Object] {
              private lazy val internalIter = readCurrentFile()
              override def getNext(): AnyRef = {
                try {
                  if (internalIter.hasNext) {
                    internalIter.next()
                  } else {
                    finished = true
                    null
                  }
                } catch {
                }
              }
              override def close(): Unit = {}
            }
          } else {
            currentIterator = readCurrentFile()
          }
          try {
            hasNext
          } catch {
          }
        } else {
          currentFile = null
          InputFileBlockHolder.unset()
          false
        }
      }
      override def close(): Unit = {
        incTaskInputMetricsBytesRead()
        InputFileBlockHolder.unset()
      }
    }
    context.addTaskCompletionListener[Unit](_ => iterator.close())
    iterator.asInstanceOf[Iterator[InternalRow]] // This is an erasure hack.
  }

执行策略（SparkStrategy）

SparkStrategy作为逻辑执行计划到物理执行计划的桥梁，其继承关系如下：
在这里插入图片描述
QueryPlanner是将所有SparkStrategy应用于LogicalPlan将其输出多个物理执行计划PhysicalPlan，其子类SparkPlanner中定义了多个Strategy，从上图可以看到具体的Strategy，其Strategy的具体作用如下：

重要方法是plan
1.将strategies应用到LogicalPlan生成候选的物理执行计划集合
2.如果集合中存在PlanLater类型的SparkPlan，则通过placeholder取出对应的LogicalPlan后，递归调用plan()方法，将PlanLater替换成子节点的物理执行计划。（PlanLater在这里用处不大，只做占位符用）
3.对物理执行计划进行过滤(当前是直接返回传入的plans，未做过滤)

//QueryPlanner
  def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    val candidates = strategies.iterator.flatMap(_(plan))
    val plans = candidates.flatMap { candidate =>
      val placeholders = collectPlaceholders(candidate)
      if (placeholders.isEmpty) {
        Iterator(candidate)
      } else {
        placeholders.iterator.foldLeft(Iterator(candidate)) {
          case (candidatesWithPlaceholders, (placeholder, logicalPlan)) =>
            val childPlans = this.plan(logicalPlan)
            candidatesWithPlaceholders.flatMap { candidateWithPlaceholders =>
              childPlans.map { childPlan =>
                candidateWithPlaceholders.transformUp {
                  case p if p.eq(placeholder) => childPlan
                }
              }
            }
        }
      }
    }
    val pruned = prunePlans(plans)
    assert(pruned.hasNext, s"No plan for $plan")
    pruned
  }

如上图，SparkStrategy包括几种常见的Strategy，其作用如下：

物理计划生成策略	描述
DataSourceV2Strategy	V2版本的数据源策略
DataSourceStrategy	V1版本的数据源策略
FileSourceStrategy	文件数据源扫描策略
Aggregation	聚合策略
Window	窗口策略
JoinSelection	Join相关策略
InMemoryScans	内存数据源扫描策略
BasicOperators	基本算子生成策略

基本算子生成策略（BasicOperators）

其中定义了基本算子如Filter、Project、Union等的逻辑计划和物理计划的映射关系。

  object BasicOperators extends Strategy {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    ...
      case r: RunnableCommand => ExecutedCommandExec(r) :: Nil
      case plan @ ApplicationLeafNode(output) =>
        plan.buildLeafExecNode() :: Nil
      case plan @ ApplicationUnaryNode(child, output) =>
        plan.buildUnaryExecNode() :: Nil
      case plan @ ApplicationBinaryNode(left, right, output) =>
        plan.buildBinaryExecNode() :: Nil
      case logical.Sort(sortExprs, global, child) =>
        execution.SortExec(sortExprs, global, planLater(child)) :: Nil
      case logical.Project(projectList, child) =>
        execution.ProjectExec(projectList, planLater(child)) :: Nil
      case logical.Filter(condition, child) =>
        execution.FilterExec(condition, planLater(child)) :: Nil
      case logical.LocalLimit(IntegerLiteral(limit), child) =>
        execution.LocalLimitExec(limit, planLater(child)) :: Nil
      case logical.GlobalLimit(IntegerLiteral(limit), child) =>
        execution.GlobalLimitExec(limit, planLater(child)) :: Nil
      case logical.Union(unionChildren) =>
        execution.UnionExec(unionChildren.map(planLater)) :: Nil
    ...
    }
  }

文件数据源扫描策略（FileSourceStrategy）

匹配到PhysicalOperation加上LogicalRelation节点最终会构造FileSourceScanExec，并在之后加上过滤（FilterExec）和投影（ProjectExec）

//FileSourceStrategy
  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    case PhysicalOperation(projects, filters,
      l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) =>
      ...
      val scan =
        FileSourceScanExec(
          fsRelation,
          outputAttributes,
          outputSchema,
          partitionKeyFilters.toSeq,
          bucketSet,
          dataFilters,
          table.map(_.identifier))
      val afterScanFilter = afterScanFilters.toSeq.reduceOption(expressions.And)
      val withFilter = afterScanFilter.map(execution.FilterExec(_, scan)).getOrElse(scan)
      val withProjections = if (projects == withFilter.output) {
        withFilter
      } else {
        execution.ProjectExec(projects, withFilter)
      }
      withProjections :: Nil
    case _ => Nil
  }

执行前准备

通过执行策略生成了物理执行计划后，还不能直接执行，需要经过prepareForExecution之后才生成最终执行的物理执行计划，preparations中定义了5个规则，具体作用如下：

规则名	作用描述
PlanSubqueries	特殊子查询物理计划处理
EnsureRequirements	确保分区和排序正确
CollapseCodegenStages	代码生成相关
ReuseExchange	重用Exchange节点
ReuseSubquery	重用子查询

//QueryExecution
  protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
    preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
  }
  
  protected def preparations: Seq[Rule[SparkPlan]] = Seq(
    PlanSubqueries(sparkSession),//特殊子查询物理计划处理
    EnsureRequirements(sparkSession.sessionState.conf),//确保分区和排序正确
    CollapseCodegenStages(sparkSession.sessionState.conf),//代码生成相关
    ReuseExchange(sparkSession.sessionState.conf),//重用Exchange节点
    ReuseSubquery(sparkSession.sessionState.conf))//重用子查询

接下来重点介绍下EnsureRequirements规则

EnsureRequirements规则

EnsureRequirements主要作用是确保分区和排序正确，也就是如果输入数据的分区或有序性无法满足当前节点的处理逻辑，则EnsureRequirements会在物理计划中添加一些Shuffle操作或排序操作来满足要求。先看下其apply方法：
1.如果匹配到ShuffleExchangeExec节点，如果其子节点的分区输出方式是HashPartitioning，则用子节点来替换自己。否则走ensureDistributionAndOrdering逻辑。

//EnsureRequirements
  def apply(plan: SparkPlan): SparkPlan = plan.transformUp {
    // TODO: remove this after we create a physical operator for `RepartitionByExpression`.
    case operator @ ShuffleExchangeExec(upper: HashPartitioning, child, _) =>
      child.outputPartitioning match {
        case lower: HashPartitioning if upper.semanticEquals(lower) => child
        case _ => operator
      }
    case operator: SparkPlan =>
      ensureDistributionAndOrdering(reorderJoinPredicates(operator))
  }

2.获取当前节点要求子节点满足的分区方式和排序方式，如果子节点输出的分区满足需要的分区方式。则用子节点替换，否则，如果需要BroadcastDistribution，则新增BroadcastExchangeExec(mode, child)节点，否则新增ShuffleExchangeExec(distribution.createPartitioning(numPartitions), child)节点(其中numPartitions可通过spark.sql.shuffle.partitions配置，默认200)。

//EnsureRequirements.ensureDistributionAndOrdering
    val requiredChildDistributions: Seq[Distribution] = operator.requiredChildDistribution
    val requiredChildOrderings: Seq[Seq[SortOrder]] = operator.requiredChildOrdering
    var children: Seq[SparkPlan] = operator.children
    children = children.zip(requiredChildDistributions).map {
      case (child, distribution) if child.outputPartitioning.satisfies(distribution) =>
        child
      case (child, BroadcastDistribution(mode)) =>
        BroadcastExchangeExec(mode, child)
      case (child, distribution) =>
        val numPartitions = distribution.requiredNumPartitions
          .getOrElse(defaultNumPreShufflePartitions)
        ShuffleExchangeExec(distribution.createPartitioning(numPartitions), child)
    }

3.针对2个或2个以上的子节点情况，如果不满足时也需要创建ShuffleExchangeExec节点

//EnsureRequirements.ensureDistributionAndOrdering
val childrenIndexes = requiredChildDistributions.zipWithIndex.filter {
      case (UnspecifiedDistribution, _) => false
      case (_: BroadcastDistribution, _) => false
      case _ => true
    }.map(_._2)
    val childrenNumPartitions =
      childrenIndexes.map(children(_).outputPartitioning.numPartitions).toSet
    if (childrenNumPartitions.size > 1) {
      val requiredNumPartitions = {
        val numPartitionsSet = childrenIndexes.flatMap {
          index => requiredChildDistributions(index).requiredNumPartitions
        }.toSet
        numPartitionsSet.headOption
      }
      val targetNumPartitions = requiredNumPartitions.getOrElse(childrenNumPartitions.max)
      children = children.zip(requiredChildDistributions).zipWithIndex.map {
        case ((child, distribution), index) if childrenIndexes.contains(index) =>
          if (child.outputPartitioning.numPartitions == targetNumPartitions) {
            child
          } else {
            val defaultPartitioning = distribution.createPartitioning(targetNumPartitions)
            child match {
              case ShuffleExchangeExec(_, c, _) => ShuffleExchangeExec(defaultPartitioning, c)
              case _ => ShuffleExchangeExec(defaultPartitioning, child)
            }
          }
        case ((child, _), _) => child
      }
    }

4.利用ExchangeCoordinator协调分区，withCoordinator 开启需要设置spark.sql.adaptive.enabled为true，目前对SS不支持。

    if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
      logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
          "is not supported in streaming DataFrames/Datasets and will be disabled.")
    }

   val withCoordinator =
      if (adaptiveExecutionEnabled && supportsCoordinator) {
        val coordinator =
          new ExchangeCoordinator(
            targetPostShuffleInputSize,
            minNumPostShufflePartitions)
        children.zip(requiredChildDistributions).map {
          case (e: ShuffleExchangeExec, _) =>
            e.copy(coordinator = Some(coordinator))
          case (child, distribution) =>
            val targetPartitioning = distribution.createPartitioning(defaultNumPreShufflePartitions)
            ShuffleExchangeExec(targetPartitioning, child, Some(coordinator))
        }
      } else {
        children
      }

参考资料
[1]: 《Spark SQL内部剖析》朱锋张韶全黄明著

Echo Lee.

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Spark-SQL物理执行

文章目录Spark-SQL物理执行优化（Optimizer）一、Push Down1. PushProjectionThroughUnion（Union的Project下推）2. EliminateOuterJoin（消除外连接）3. PushPredicateThroughJoin（Join谓词下推）4. PushDownPredicate（谓词下推）5. ReOrderJoin（重新调整joi...
复制链接

扫一扫