SparkStrategy: logical to physical
Catalyst作为一个实现无关的查询优化框架,在优化后的逻辑执行计划到真正的物理执行计划这部分只提供了接口,没有提供像Analyzer和Optimizer那样的实现。
本文介绍的是Spark SQL组件各个物理执行计划的操作实现。把优化后的逻辑执行计划映射到物理执行操作类这部分由SparkStrategies类实现,内部基于Catalyst提供的Strategy接口,实现了一些策略,用于分辨logicalPlan子类并替换为合适的SparkPlan子类。
SparkPlan继承体系如下。接下里会具体介绍其子类的实现。
SparkPlan
主要三部分:LeafNode、UnaryNode、BinaryNode
各自的实现类:
提供四个需要子类重载的方法
-
-
- def outputPartitioning: Partitioning = UnknownPartitioning(0)
-
- def requiredChildDistribution: Seq[Distribution] =
- Seq.fill(children.size)(UnspecifiedDistribution)
-
- def execute(): RDD[Row]
- def executeCollect(): Array[Row] = execute().collect()
Distribution和Partitioning类用于表示数据分布情况。有以下几类,可以望文生义。
LeafNode
ExistingRdd
先介绍下Row和GenericRow的概念。
Row是一行output对应的数据,提供getXXX(i: Int)方法
- trait Row extends Seq[Any] with Serializable
支持数据类型包括Int, Long, Double, Float, Boolean, Short, Byte, String。支持按序数(ordinal)读取某一个列的值。读取前需要做isNullAt(i: Int)的判断。
对应的有一个MutableRow类,提供setXXX(i: Int, value: Any)方法。可以修改(set)某序数上的值
GenericRow是Row的一种方便实现,存的是一个数组
- class GenericRow(protected[catalyst] val values: Array[Any]) extends Row
所以对应的取值操作和判断是否为空操作会转化为数组上的定位取值操作。
它也有一个对应的GenericMutableRow类,可以修改(set)值。
ExistingRdd用于把绑定了case class的rdd的数据,转变为RDD[Row],同时反射提取出case class的属性(output)。转化过程的单例类和伴生对象如下:
- object ExistingRdd {
- def convertToCatalyst(a: Any): Any = a match {
- case s: Seq[Any] => s.map(convertToCatalyst)
- case p: Product => new GenericRow(p.productIterator.map(convertToCatalyst).toArray)
- case other => other
- }
-
- def productToRowRdd[A <: Product](data: RDD[A]): RDD[Row] = {
-
- data.map(r => new GenericRow(r.productIterator.map(convertToCatalyst).toArray): Row)
- }
-
- def fromProductRdd[A <: Product : TypeTag](productRdd: RDD[A]) = {
- ExistingRdd(ScalaReflection.attributesFor[A], productToRowRdd(productRdd))
- }
- }
-
- case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode {
- def execute() = rdd
- }
UnaryNode
Aggregate
隐式转换声明,针对本地分区的RDD,扩充了一些操作
-
- import org.apache.spark.rdd.PartitionLocalRDDFunctions._
Groups input data by`groupingExpressions` and computes the `aggregateExpressions` for each group.
@param child theinput data source.
- case class Aggregate(
- partial: Boolean,
- groupingExpressions: Seq[Expression],
- aggregateExpressions: Seq[NamedExpression],
- child: SparkPlan)(@transient sc: SparkContext)
在初始化的时候,partial这个参数用来标志本次Aggregate操作只在本地做,还是要去到符合groupExpression的其他partition上都做。该判断逻辑如下:
- override def requiredChildDistribution =
- if (partial) {
- UnspecifiedDistribution :: Nil
- } else {
-
- if (groupingExpressions == Nil) {
- AllTuples :: Nil
-
- } else {
- ClusteredDistribution(groupingExpressions) :: Nil
- }
- }
最重要的execute()方法:
- def execute() = attachTree(this, "execute") {
-
- val grouped = child.execute().mapPartitions { iter =>
- val buildGrouping = new Projection(groupingExpressions)
- iter.map(row => (buildGrouping(row), row.copy()))
- }.groupByKeyLocally()
-
- val result = grouped.map { case (group, rows) =>
-
-
- val aggImplementations = createAggregateImplementations()
-
-
- val aggFunctions = aggImplementations.flatMap(_ collect { case f: AggregateFunction => f })
-
- rows.foreach { row =>
- aggFunctions.foreach(_.update(row))
- }
- buildRow(aggImplementations.map(_.apply(group)))
- }
-
-
- if (groupingExpressions.isEmpty && result.count == 0) {
-
- val aggImplementations = createAggregateImplementations()
- sc.makeRDD(buildRow(aggImplementations.map(_.apply(null))) :: Nil)
- } else {
- result
- }
- }
AggregateExpression继承体系如下,这部分代码在Catalyst expressions包的aggregates.scala里:
他的第一类实现AggregateFunction,带一个update(input: Row)操作。子类的update操作是实际对Row执行变化。
DebugNode
DebugNode是把传进来child SparkPlan调用execute()执行,然后把结果childRdd逐个输出查看
- case class DebugNode(child: SparkPlan) extends UnaryNode
Exchange
- case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends UnaryNode
为某个SparkPlan,实施新的分区策略。
execute()方法:
- def execute() = attachTree(this , "execute") {
- newPartitioning match {
- case HashPartitioning(expressions, numPartitions) =>
-
- val rdd = child.execute().mapPartitions { iter =>
- val hashExpressions = new MutableProjection(expressions)
- val mutablePair = new MutablePair[Row, Row]()
- iter.map(r => mutablePair.update(hashExpressions(r), r))
- }
- val part = new HashPartitioner(numPartitions)
-
- val shuffled = new ShuffledRDD[Row, Row, MutablePair[Row, Row]](rdd, part)
- shuffled.setSerializer(new SparkSqlSerializer(new SparkConf(false)))
- shuffled.map(_._2)
-
- case RangePartitioning(sortingExpressions, numPartitions) =>
-
- implicit val ordering = new RowOrdering(sortingExpressions)
-
- val rdd = child.execute().mapPartitions { iter =>
- val mutablePair = new MutablePair[Row, Null](null, null)
- iter.map(row => mutablePair.update(row, null))
- }
- val part = new RangePartitioner(numPartitions, rdd, ascending = true)
- val shuffled = new ShuffledRDD[Row, Null, MutablePair[Row, Null]](rdd, part)
- shuffled.setSerializer(new SparkSqlSerializer(new SparkConf(false)))
- shuffled.map(_._1)
-
- case SinglePartition =>
- child.execute().coalesce(1, shuffle = true)
-
- case _ => sys.error(s"Exchange not implemented for $newPartitioning")
-
- }
- }
Filter
- case class Filter(condition: Expression, child: SparkPlan) extends UnaryNode
-
- def execute() = child.execute().mapPartitions { iter =>
- iter.filter(condition.apply(_).asInstanceOf[Boolean])
- }
Generate
- case class Generate(
- generator: Generator,
- join: Boolean,
- outer: Boolean,
- child: SparkPlan)
- extends UnaryNode
首先,Generator是表达式的子类,继承结构如下
Generator的作用是把input的row处理后输出0个或多个rows,makeOutput()的策略由子类实现。
Explode类做法是将输入的input array里的每一个value(可能是ArrayType,可能是MapType),变成一个GenericRow(Array(v)),输出就是一个
回到Generate操作,
join布尔值用于指定最后输出的结果是否要和输入的原tuple显示做join
outer布尔值只有在join为true的时候才生效,且outer为true的时候,每个input的row都至少会被作为一次output
总体上,Generate操作类似FP里的flatMap操作
- def execute() = {
- if (join) {
- child.execute().mapPartitions { iter =>
- val nullValues = Seq.fill(generator.output.size)(Literal(null))
-
- val outerProjection =
- new Projection(child.output ++ nullValues, child.output)
-
- val joinProjection =
- new Projection(child.output ++ generator.output, child.output ++ generator.output)
- val joinedRow = new JoinedRow
-
- iter.flatMap {row =>
- val outputRows = generator(row)
- if (outer && outputRows.isEmpty) {
- outerProjection(row) :: Nil
- } else {
- outputRows.map(or => joinProjection(joinedRow(row, or)))
- }
- }
- }
- } else {
- child.execute().mapPartitions(iter => iter.flatMap(generator))
- }
- }
Project
- case class Project(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode
project的执行:
- def execute() = child.execute().mapPartitions { iter =>
- @transient val reusableProjection = new MutableProjection(projectList)
- iter.map(reusableProjection)
- }
MutableProjection类是Row => Row的继承类,它构造的时候接收一个Seq[Expression],还允许接收一个inputSchema: Seq[Attribute]。MutableProjection用于根据表达式(和Schema,如果有Schema的话)把Row映射成新的Row,改变内部的column。
Sample
- case class Sample(fraction: Double, withReplacement: Boolean, seed: Int, child: SparkPlan) extends UnaryNode
-
- def execute() = child.execute().sample(withReplacement, fraction, seed)
RDD的sample操作:
- def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = {
- require(fraction >= 0.0, "Invalid fraction value: " + fraction)
- if (withReplacement) {
- new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), seed)
- } else {
- new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), seed)
- }
- }
生成的PartitionwiseSampledRDD会在父RDD的每个partition都选取样本。
PossionSampler和BernoulliSampler是RandomSampler的两种实现。
Sort
- case class Sort(
- sortOrder: Seq[SortOrder],
- global: Boolean,
- child: SparkPlan)
- extends UnaryNode
对分布有要求:
- override def requiredChildDistribution =
- if (global) OrderedDistribution(sortOrder) :: Nil
- else UnspecifiedDistribution :: Nil
SortOrder类是UnaryExpression的实现,定义了tuple排序的策略(递增或递减)。该类只是为child expression们声明了排序策略。之所以继承Expression,是为了能影响到子树。
- case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression
-
- @transient
- lazy val ordering = new RowOrdering(sortOrder)
-
- def execute() = attachTree(this, "sort") {
-
- child.execute()
- .mapPartitions(iterator => iterator.map(_.copy()).toArray.sorted(ordering).iterator,
- preservesPartitioning = true)
- }
有一次隐式转换过程,.sorted是array自带的一个方法,因为ordering是RowOrdering类,该类继承Ordering[T],是scala.math.Ordering[T]类。
StopAfter
- case class StopAfter(limit: Int, child: SparkPlan)(@transient sc: SparkContext) extends UnaryNode
StopAfter实质上是一次limit操作
- override def executeCollect() = child.execute().map(_.copy()).take(limit)
- def execute() = sc.makeRDD(executeCollect(), 1)
makeRDD实质上调用的是new ParallelCollectionRDD[T]的操作,此处的seq为tabke()返回的Array[T],而numSlices为1:
-
- def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = {
- new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
- }
TopK
- case class TopK(limit: Int, sortOrder: Seq[SortOrder], child: SparkPlan)
- (@transient sc: SparkContext) extends UnaryNode
可以把TopK理解为类似Sort和StopAfter的结合,
- @transient
- lazy val ordering = new RowOrdering(sortOrder)
-
- override def executeCollect() = child.execute().map(_.copy()).takeOrdered(limit)(ordering)
- def execute() = sc.makeRDD(executeCollect(), 1)
takeOrdered(num)(sorting)实际触发的是RDD的top()()操作
- def top(num: Int)(implicit ord: Ordering[T]): Array[T] = {
- mapPartitions { items =>
- val queue = new BoundedPriorityQueue[T](num)
- queue ++= items
- Iterator.single(queue)
- }.reduce { (queue1, queue2) =>
- queue1 ++= queue2
- queue1
- }.toArray.sorted(ord.reverse)
- }
BoundedPriorityQueue是Spark util包里的一个数据结构,包装了PriorityQueue,他的优化点在于限制了优先队列的大小,比如在添加元素的时候,如果超出size了,就会进行对堆进行比较和替换。适合TopK的场景。
所以每个partition在排序前,只会产生一个num大小的BPQ(最后只需要选Top num个),合并之后才做真正的排序,最后选出前num个。
BinaryNode
BroadcastNestedLoopJoin
- case class BroadcastNestedLoopJoin(
- streamed: SparkPlan, broadcast: SparkPlan, joinType: JoinType, condition: Option[Expression])
- (@transient sc: SparkContext)
- extends BinaryNode
比较复杂的一次join操作,操作如下,
- def execute() = {
-
- val broadcastedRelation =
- sc.broadcast(broadcast.execute().map(_.copy()).collect().toIndexedSeq)
-
- val streamedPlusMatches = streamed.execute().mapPartitions { streamedIter =>
- val matchedRows = new mutable.ArrayBuffer[Row]
- val includedBroadcastTuples =
- new mutable.BitSet(broadcastedRelation.value.size)
- val joinedRow = new JoinedRow
-
- streamedIter.foreach { streamedRow =>
- var i = 0
- var matched = false
-
- while (i < broadcastedRelation.value.size) {
-
- val broadcastedRow = broadcastedRelation.value(i)
- if (boundCondition(joinedRow(streamedRow, broadcastedRow)).asInstanceOf[Boolean]) {
- matchedRows += buildRow(streamedRow ++ broadcastedRow)
- matched = true
- includedBroadcastTuples += i
- }
- i += 1
- }
-
- if (!matched && (joinType == LeftOuter || joinType == FullOuter)) {
- matchedRows += buildRow(streamedRow ++ Array.fill(right.output.size)(null))
- }
- }
- Iterator((matchedRows, includedBroadcastTuples))
- }
-
- val includedBroadcastTuples = streamedPlusMatches.map(_._2)
- val allIncludedBroadcastTuples =
- if (includedBroadcastTuples.count == 0) {
- new scala.collection.mutable.BitSet(broadcastedRelation.value.size)
- } else {
- streamedPlusMatches.map(_._2).reduce(_ ++ _)
- }
-
- val rightOuterMatches: Seq[Row] =
- if (joinType == RightOuter || joinType == FullOuter) {
- broadcastedRelation.value.zipWithIndex.filter {
- case (row, i) => !allIncludedBroadcastTuples.contains(i)
- }.map {
-
- case (row, _) => buildRow(Vector.fill(left.output.size)(null) ++ row)
- }
- } else {
- Vector()
- }
-
-
- sc.union(
- streamedPlusMatches.flatMap(_._1), sc.makeRDD(rightOuterMatches))
- }
CartesianProduct
- case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode
调用的是RDD的笛卡尔积操作,
- def execute() =
- left.execute().map(_.copy()).cartesian(right.execute().map(_.copy())).map {
- case (l: Row, r: Row) => buildRow(l ++ r)
- }
SparkEquiInnerJoin
- case class SparkEquiInnerJoin(
- leftKeys: Seq[Expression],
- rightKeys: Seq[Expression],
- left: SparkPlan,
- right: SparkPlan) extends BinaryNode
该join操作适用于left和right两部分partition一样大且提供各自keys的情况。
基本上看代码就可以了,没有什么可以说明的,做local join的时候借助的是PartitionLocalRDDFunctions里的方法。
- def execute() = attachTree(this, "execute") {
- val leftWithKeys = left.execute().mapPartitions { iter =>
- val generateLeftKeys = new Projection(leftKeys, left.output)
- iter.map(row => (generateLeftKeys(row), row.copy()))
- }
-
- val rightWithKeys = right.execute().mapPartitions { iter =>
- val generateRightKeys = new Projection(rightKeys, right.output)
- iter.map(row => (generateRightKeys(row), row.copy()))
- }
-
-
-
- val joined = filterNulls(leftWithKeys).joinLocally(filterNulls(rightWithKeys))
-
- joined.map { case (_, (leftTuple, rightTuple)) => buildRow(leftTuple ++ rightTuple) }
- }
-
-
-
-
-
- protected def filterNulls(rdd: RDD[(Row, Row)]) =
- rdd.filter {
- case (key: Seq[_], _) => !key.exists(_ == null)
- }
PartitionLocalRDDFunctions方法如下,该操作并不引入shuffle操作。两个RDD的partition数目需要相等。
- def joinLocally[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = {
- cogroupLocally(other).flatMapValues {
- case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (v, w)
- }
- }
Other
Union
该操作直接继承SparkPlan
- case class Union(children: Seq[SparkPlan])(@transient sc: SparkContext) extends SparkPlan
用传入的SparkPlan集合各自的RDD执行结果生成一个UnionRDD
- def execute() = sc.union(children.map(_.execute()))