Spark-SQL物理执行

Spark-SQL物理执行

物理执行作为Spark-SQL执行过程中的最后一步,是将逻辑执行计划转换为物理执行计划SparkPlan,然后可以在Spark-Core上直接运行生成RDD。
Spark-Sql的整个执行过程其实在QueryExecution中定义得非常清楚,如代码所示:

//QueryExecution
//执行优化
  lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
//选取物理执行计划
  lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
//目前只是选取多个物理计划的第一个
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }
//执行前准备
  lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
//执行物理计划
  lazy val toRdd: RDD[InternalRow] = executedPlan.execute()

分区操作和分布情况(Partitioning和Distribution)

在spark中,分区一直是影响性能的重要指标。尤其是在Join和聚合的场景中,例如在plan1和plan2的Hash Join中,plan1和plan2的分区方式就必须要求是基于相同key的hash分布。如果是广播类型的Join,就要求至少有一边是广播变量数据分布。
Distribution是数据分布情况,Partitioning是分区操作
Distribution与Partitioning关联,定义了数据在集群各个节点上的分布情况
在这里插入图片描述
Distribution包括以下6种:

Distribution类型描述
UnspecifiedDistribution未指定分布
AllTuples单分区,例如GlobalLimit算子
BroadcastDistribution广播分布,数据会广播到所有节点上,构造参数mode为广播模式(BroadcastMode),例如Broadcast的Join操作中的requiredChildDistribution为[BroadcastDistribution(mode)]
ClusteredDistribution构造参数clustering是Seq[Expression]类型,起到哈希函数的效果,经过clustering之后,相同的value数据会放到一个分区中,例如SortAggregateExec类型的Join操作中的requiredChildDistribution就是ClusteredDistribution(exprs)
HashClusteredDistribution构造参数expressions是Seq[Expression]类型,起到哈希函数的效果,经过expressions之后,相同的value数据会放到一个分区中,例如SortMerge类型的Join操作中的requiredChildDistribution就是[HashClusteredDistribution(leftKeys), HashClusteredDistribution(reghtKeys)]
OrderedDistribution构造参数ordering是Seq[SortOrder]类型,数据会根据ordering计算后的结果排序。在全局的Sort算子中,requiredChildDistribution就是[OrderedDistribution(sortOrder)]

Partitioning表示数据分区操作,如上图所示,介绍下内部重要的成员变量和函数

trait Partitioning {
  //该SparkPlan输出RDD的分区数目
  val numPartitions: Int
  //当前的partitioning操作能否得到所需的数据分布,当不满足时返回false,一般需要进行repartition操作,
  //对数据进行重新组织
  final def satisfies(required: Distribution): Boolean = {
    required.requiredNumPartitions.forall(_ == numPartitions) && satisfies0(required)
  }
  protected def satisfies0(required: Distribution): Boolean = required match {
    case UnspecifiedDistribution => true
    case AllTuples => numPartitions == 1
    case _ => false
  }
}

Partitioning包括以下7种

Partitioning类型描述
UnknownPartitioning不进行分区
RoundRobinPartitioning在1-numPartitions范围内轮训式分区
HashPartitioning基于Hash的分区
RangePartitioning基于范围的分区
PartitioningCollection分区方式的集合,描述物理算子的输出
BroadcastPartitioning广播分区
DataSourcePartitioningV2 DataSource的分区方式

物理计划(SparkPlan)

SparkPlan和LogicalPlan基本是一一对应的,和LogicalPlan类似,都继承自QueryPlan[PlanType <: QueryPlan[PlanType]],命名规则都是XXXExec。SparkPlan的重要方法如下:

方法作用描述
outputPartitioning定义SparkPlan输出数据的分区方式
requiredChildDistribution定义SparkPlan要求子节点遵守的分区方式
outputOrdering定义SparkPlan输出数据的排序方式
requiredChildOrdering定义SparkPlan要求子节点遵守的排序方式
doExecute执行生成RDD
abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {
...
//定义SparkPlan输出数据的分区方式
def outputPartitioning: Partitioning = UnknownPartitioning(0)
//定义SparkPlan要求子节点遵守的分区方式
def requiredChildDistribution: Seq[Distribution] =
    Seq.fill(children.size)(UnspecifiedDistribution)
//定义SparkPlan输出数据的排序方式
def outputOrdering: Seq[SortOrder] = Nil
//定义SparkPlan要求子节点遵守的排序方式
def requiredChildOrdering: Seq[Seq[SortOrder]] = Seq.fill(children.size)(Nil)
//执行操作
protected def doExecute(): RDD[InternalRow]

在这里插入图片描述
在这里插入图片描述
下面介绍几个比较常见的SparkPlan:

投影(ProjectExec)

doExecute方法比较简单,就是执行子节点的execute方法并通过子节点的输出属性构造RDD[InternalRow]返回,其中输出的Ordering和Partitioning都取自子节点的排序和分区。

case class ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan)
  extends UnaryExecNode with CodegenSupport {
  override def output: Seq[Attribute] = projectList.map(_.toAttribute)
  override def inputRDDs(): Seq[RDD[InternalRow]] = {
    child.asInstanceOf[CodegenSupport].inputRDDs()
  }
  protected override def doExecute(): RDD[InternalRow] = {
    child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
      val project = UnsafeProjection.create(projectList, child.output,
        subexpressionEliminationEnabled)
      project.initialize(index)
      iter.map(project)
    }
  }
  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
  override def outputPartitioning: Partitioning = child.outputPartitioning
}

过滤(FilterExec)

过滤的execute方法也比较简单,也是执行子节点的execute方法,通过子节点的输出属性和condition构造predicate表达式,计算数据并返回RDD[InternalRow]

case class FilterExec(condition: Expression, child: SparkPlan)
  extends UnaryExecNode with CodegenSupport with PredicateHelper {
  protected override def doExecute(): RDD[InternalRow] = {
    val numOutputRows = longMetric("numOutputRows")
    child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
      val predicate = newPredicate(condition, child.output)
      predicate.initialize(0)
      iter.filter { row =>
        val r = predicate.eval(row)
        if (r) numOutputRows += 1
        r
      }
    }
  }
  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
  override def outputPartitioning: Partitioning = child.outputPartitioning
}

文件读取(FileSourceScanExec)

1.通过FileSourceStrategy如果匹配到LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)将生成FileSourceScanExec

//FileSourceStrategy
  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    case PhysicalOperation(projects, filters,
      l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) =>
      ...
      val scan =
        FileSourceScanExec(
          fsRelation,
          outputAttributes,
          outputSchema,
          partitionKeyFilters.toSeq,
          bucketSet,
          dataFilters,
          table.map(_.identifier))
     ...
  }

2.接下来看其doExecute方法,看看inputRDD是如何生成的

//FileSourceScanExec
  protected override def doExecute(): RDD[InternalRow] = {
    if (supportsBatch) {
      WholeStageCodegenExec(this)(codegenStageId = 0).execute()
    } else {
      val numOutputRows = longMetric("numOutputRows")
      if (needsUnsafeRowConversion) {
        inputRDD.mapPartitionsWithIndexInternal { (index, iter) =>
          val proj = UnsafeProjection.create(schema)
          proj.initialize(index)
          iter.map( r => {
            numOutputRows += 1
            proj(r)
          })
        }
      } else {
        inputRDD.map { r =>
          numOutputRows += 1
          r
        }
      }
    }
  }

3.如果设置了bucket,即按照设置的字段分区,则使用createBucketedReadRDD,否则用createNonBucketedReadRDD

//FileSourceScanExec
  private lazy val inputRDD: RDD[InternalRow] = {
    val readFile: (PartitionedFile) => Iterator[InternalRow] =
      relation.fileFormat.buildReaderWithPartitionValues(
        sparkSession = relation.sparkSession,
        dataSchema = relation.dataSchema,
        partitionSchema = relation.partitionSchema,
        requiredSchema = requiredSchema,
        filters = pushedDownFilters,
        options = relation.options,
        hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))
    relation.bucketSpec match {
      case Some(bucketing) if relation.sparkSession.sessionState.conf.bucketingEnabled =>
        createBucketedReadRDD(bucketing, readFile, selectedPartitions, relation)
      case _ =>
        createNonBucketedReadRDD(readFile, selectedPartitions, relation)
    }
  }

4.以createBucketedReadRDD为例来分析,最终生成FileScanRDD返回。

//FileSourceScanExec
  private def createBucketedReadRDD(
      bucketSpec: BucketSpec,
      readFile: (PartitionedFile) => Iterator[InternalRow],
      selectedPartitions: Seq[PartitionDirectory],
      fsRelation: HadoopFsRelation): RDD[InternalRow] = {
   ...
    val filePartitions = Seq.tabulate(bucketSpec.numBuckets) { bucketId =>
      FilePartition(bucketId, prunedFilesGroupedToBuckets.getOrElse(bucketId, Nil))
    }
    new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions)
  }

5.接下来看看FileScanRDD的compute方法,读取当前文件是通过readFunction,即构造inputRDD中定义的readFile,即FileFormat的buildReaderWithPartitionValues方法,buildReaderWithPartitionValues中构造的是ParquetFileReader,通过RecordReaderIterator迭代读取。返回的是Iterator[InternalRow]。

  override def compute(split: RDDPartition, context: TaskContext): Iterator[InternalRow] = {
    val iterator = new Iterator[Object] with AutoCloseable {
    ...
      private[this] val files = split.asInstanceOf[FilePartition].files.toIterator
      private[this] var currentFile: PartitionedFile = null
      private[this] var currentIterator: Iterator[Object] = null

      def hasNext: Boolean = {
        context.killTaskIfInterrupted()
        (currentIterator != null && currentIterator.hasNext) || nextIterator()
      }
      def next(): Object = {
        val nextElement = currentIterator.next()
        nextElement
      }

      private def readCurrentFile(): Iterator[InternalRow] = {
        try {
          readFunction(currentFile)
        } catch {
        }
      }
      
      private def nextIterator(): Boolean = {
        if (files.hasNext) {
          currentFile = files.next()
          InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)

          if (ignoreMissingFiles || ignoreCorruptFiles) {
            currentIterator = new NextIterator[Object] {
              private lazy val internalIter = readCurrentFile()
              override def getNext(): AnyRef = {
                try {
                  if (internalIter.hasNext) {
                    internalIter.next()
                  } else {
                    finished = true
                    null
                  }
                } catch {
                }
              }
              override def close(): Unit = {}
            }
          } else {
            currentIterator = readCurrentFile()
          }
          try {
            hasNext
          } catch {
          }
        } else {
          currentFile = null
          InputFileBlockHolder.unset()
          false
        }
      }
      override def close(): Unit = {
        incTaskInputMetricsBytesRead()
        InputFileBlockHolder.unset()
      }
    }
    context.addTaskCompletionListener[Unit](_ => iterator.close())
    iterator.asInstanceOf[Iterator[InternalRow]] // This is an erasure hack.
  }

执行策略(SparkStrategy)

SparkStrategy作为逻辑执行计划到物理执行计划的桥梁,其继承关系如下:
在这里插入图片描述
QueryPlanner是将所有SparkStrategy应用于LogicalPlan将其输出多个物理执行计划PhysicalPlan,其子类SparkPlanner中定义了多个Strategy,从上图可以看到具体的Strategy,其Strategy的具体作用如下:
在这里插入图片描述
重要方法是plan
1.将strategies应用到LogicalPlan生成候选的物理执行计划集合
2.如果集合中存在PlanLater类型的SparkPlan,则通过placeholder取出对应的LogicalPlan后,递归调用plan()方法,将PlanLater替换成子节点的物理执行计划。(PlanLater在这里用处不大,只做占位符用)
3.对物理执行计划进行过滤(当前是直接返回传入的plans,未做过滤)

//QueryPlanner
  def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    val candidates = strategies.iterator.flatMap(_(plan))
    val plans = candidates.flatMap { candidate =>
      val placeholders = collectPlaceholders(candidate)
      if (placeholders.isEmpty) {
        Iterator(candidate)
      } else {
        placeholders.iterator.foldLeft(Iterator(candidate)) {
          case (candidatesWithPlaceholders, (placeholder, logicalPlan)) =>
            val childPlans = this.plan(logicalPlan)
            candidatesWithPlaceholders.flatMap { candidateWithPlaceholders =>
              childPlans.map { childPlan =>
                candidateWithPlaceholders.transformUp {
                  case p if p.eq(placeholder) => childPlan
                }
              }
            }
        }
      }
    }
    val pruned = prunePlans(plans)
    assert(pruned.hasNext, s"No plan for $plan")
    pruned
  }

如上图,SparkStrategy包括几种常见的Strategy,其作用如下:

物理计划生成策略描述
DataSourceV2StrategyV2版本的数据源策略
DataSourceStrategyV1版本的数据源策略
FileSourceStrategy文件数据源扫描策略
Aggregation聚合策略
Window窗口策略
JoinSelectionJoin相关策略
InMemoryScans内存数据源扫描策略
BasicOperators基本算子生成策略

基本算子生成策略(BasicOperators)

其中定义了基本算子如Filter、Project、Union等的逻辑计划和物理计划的映射关系。

  object BasicOperators extends Strategy {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    ...
      case r: RunnableCommand => ExecutedCommandExec(r) :: Nil
      case plan @ ApplicationLeafNode(output) =>
        plan.buildLeafExecNode() :: Nil
      case plan @ ApplicationUnaryNode(child, output) =>
        plan.buildUnaryExecNode() :: Nil
      case plan @ ApplicationBinaryNode(left, right, output) =>
        plan.buildBinaryExecNode() :: Nil
      case logical.Sort(sortExprs, global, child) =>
        execution.SortExec(sortExprs, global, planLater(child)) :: Nil
      case logical.Project(projectList, child) =>
        execution.ProjectExec(projectList, planLater(child)) :: Nil
      case logical.Filter(condition, child) =>
        execution.FilterExec(condition, planLater(child)) :: Nil
      case logical.LocalLimit(IntegerLiteral(limit), child) =>
        execution.LocalLimitExec(limit, planLater(child)) :: Nil
      case logical.GlobalLimit(IntegerLiteral(limit), child) =>
        execution.GlobalLimitExec(limit, planLater(child)) :: Nil
      case logical.Union(unionChildren) =>
        execution.UnionExec(unionChildren.map(planLater)) :: Nil
    ...
    }
  }

文件数据源扫描策略(FileSourceStrategy)

匹配到PhysicalOperation加上LogicalRelation节点最终会构造FileSourceScanExec,并在之后加上过滤(FilterExec)和投影(ProjectExec)

//FileSourceStrategy
  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    case PhysicalOperation(projects, filters,
      l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) =>
      ...
      val scan =
        FileSourceScanExec(
          fsRelation,
          outputAttributes,
          outputSchema,
          partitionKeyFilters.toSeq,
          bucketSet,
          dataFilters,
          table.map(_.identifier))
      val afterScanFilter = afterScanFilters.toSeq.reduceOption(expressions.And)
      val withFilter = afterScanFilter.map(execution.FilterExec(_, scan)).getOrElse(scan)
      val withProjections = if (projects == withFilter.output) {
        withFilter
      } else {
        execution.ProjectExec(projects, withFilter)
      }
      withProjections :: Nil
    case _ => Nil
  }

执行前准备

通过执行策略生成了物理执行计划后,还不能直接执行,需要经过prepareForExecution之后才生成最终执行的物理执行计划,preparations中定义了5个规则,具体作用如下:

规则名作用描述
PlanSubqueries特殊子查询物理计划处理
EnsureRequirements确保分区和排序正确
CollapseCodegenStages代码生成相关
ReuseExchange重用Exchange节点
ReuseSubquery重用子查询
//QueryExecution
  protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
    preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
  }
  
  protected def preparations: Seq[Rule[SparkPlan]] = Seq(
    PlanSubqueries(sparkSession),//特殊子查询物理计划处理
    EnsureRequirements(sparkSession.sessionState.conf),//确保分区和排序正确
    CollapseCodegenStages(sparkSession.sessionState.conf),//代码生成相关
    ReuseExchange(sparkSession.sessionState.conf),//重用Exchange节点
    ReuseSubquery(sparkSession.sessionState.conf))//重用子查询

接下来重点介绍下EnsureRequirements规则

EnsureRequirements规则

EnsureRequirements主要作用是确保分区和排序正确,也就是如果输入数据的分区或有序性无法满足当前节点的处理逻辑,则EnsureRequirements会在物理计划中添加一些Shuffle操作或排序操作来满足要求。先看下其apply方法:
1.如果匹配到ShuffleExchangeExec节点,如果其子节点的分区输出方式是HashPartitioning,则用子节点来替换自己。否则走ensureDistributionAndOrdering逻辑。

//EnsureRequirements
  def apply(plan: SparkPlan): SparkPlan = plan.transformUp {
    // TODO: remove this after we create a physical operator for `RepartitionByExpression`.
    case operator @ ShuffleExchangeExec(upper: HashPartitioning, child, _) =>
      child.outputPartitioning match {
        case lower: HashPartitioning if upper.semanticEquals(lower) => child
        case _ => operator
      }
    case operator: SparkPlan =>
      ensureDistributionAndOrdering(reorderJoinPredicates(operator))
  }

2.获取当前节点要求子节点满足的分区方式和排序方式,如果子节点输出的分区满足需要的分区方式。则用子节点替换,否则,如果需要BroadcastDistribution,则新增BroadcastExchangeExec(mode, child)节点,否则新增ShuffleExchangeExec(distribution.createPartitioning(numPartitions), child)节点(其中numPartitions可通过spark.sql.shuffle.partitions配置,默认200)。

//EnsureRequirements.ensureDistributionAndOrdering
    val requiredChildDistributions: Seq[Distribution] = operator.requiredChildDistribution
    val requiredChildOrderings: Seq[Seq[SortOrder]] = operator.requiredChildOrdering
    var children: Seq[SparkPlan] = operator.children
    children = children.zip(requiredChildDistributions).map {
      case (child, distribution) if child.outputPartitioning.satisfies(distribution) =>
        child
      case (child, BroadcastDistribution(mode)) =>
        BroadcastExchangeExec(mode, child)
      case (child, distribution) =>
        val numPartitions = distribution.requiredNumPartitions
          .getOrElse(defaultNumPreShufflePartitions)
        ShuffleExchangeExec(distribution.createPartitioning(numPartitions), child)
    }

3.针对2个或2个以上的子节点情况,如果不满足时也需要创建ShuffleExchangeExec节点

//EnsureRequirements.ensureDistributionAndOrdering
val childrenIndexes = requiredChildDistributions.zipWithIndex.filter {
      case (UnspecifiedDistribution, _) => false
      case (_: BroadcastDistribution, _) => false
      case _ => true
    }.map(_._2)
    val childrenNumPartitions =
      childrenIndexes.map(children(_).outputPartitioning.numPartitions).toSet
    if (childrenNumPartitions.size > 1) {
      val requiredNumPartitions = {
        val numPartitionsSet = childrenIndexes.flatMap {
          index => requiredChildDistributions(index).requiredNumPartitions
        }.toSet
        numPartitionsSet.headOption
      }
      val targetNumPartitions = requiredNumPartitions.getOrElse(childrenNumPartitions.max)
      children = children.zip(requiredChildDistributions).zipWithIndex.map {
        case ((child, distribution), index) if childrenIndexes.contains(index) =>
          if (child.outputPartitioning.numPartitions == targetNumPartitions) {
            child
          } else {
            val defaultPartitioning = distribution.createPartitioning(targetNumPartitions)
            child match {
              case ShuffleExchangeExec(_, c, _) => ShuffleExchangeExec(defaultPartitioning, c)
              case _ => ShuffleExchangeExec(defaultPartitioning, child)
            }
          }
        case ((child, _), _) => child
      }
    }

4.利用ExchangeCoordinator协调分区,withCoordinator 开启需要设置spark.sql.adaptive.enabled为true,目前对SS不支持。

    if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
      logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
          "is not supported in streaming DataFrames/Datasets and will be disabled.")
    }
   val withCoordinator =
      if (adaptiveExecutionEnabled && supportsCoordinator) {
        val coordinator =
          new ExchangeCoordinator(
            targetPostShuffleInputSize,
            minNumPostShufflePartitions)
        children.zip(requiredChildDistributions).map {
          case (e: ShuffleExchangeExec, _) =>
            e.copy(coordinator = Some(coordinator))
          case (child, distribution) =>
            val targetPartitioning = distribution.createPartitioning(defaultNumPreShufflePartitions)
            ShuffleExchangeExec(targetPartitioning, child, Some(coordinator))
        }
      } else {
        children
      }

参考资料
[1]: 《Spark SQL内部剖析》朱锋 张韶全 黄明 著

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值