源码读懂spark怎样划分宽、窄依赖

最新推荐文章于 2023-02-01 22:55:28 发布

置顶超级奶霸

最新推荐文章于 2023-02-01 22:55:28 发布

阅读量513

点赞数

分类专栏： spark-源码

本文链接：https://blog.csdn.net/ws527815206/article/details/107442348

版权

spark-源码专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前言：

1、RDD

2、Dependency

3、常见的产生shuffle的transformation

3.1 groupByKey

3.2 join

前言：

宽窄依赖其实就是父子RDD之前的依赖类型，RDD的构造函数中存在deps:Seq[Dependency] 保存的就是这个信息，Dependency 分类两大类也就是。我们都知道RDD只能来源于新建或者从其他RRD transformation而来，而决定RDD间的依赖类型就大多取决于具体的transformation 函数。

spark 是按照当前RDD 与父RDD的依赖关系来划分的stage的，如果宽依赖则划分为不同stage，如果是窄依赖则在为同一个stage中的pipeline。具体示例图如下：

那具体RDD的依赖关系是由谁来决定的呢？我们从源码来分析。

1、RDD

首先我们来看RDD源码，RDD是一个抽象类、实现了大部分的RDD通用方法。RDD的默认构造函数由两个成员变量，一个是SparkContext ，另一个是deps:Seq[Dependency[_]],其中deps 就是RDD与所有父RDD的依赖关系。RDD有两个抽象方法需要去重写，一个是compute，一个是getPartitions。

  
abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
    /**
       * :: DeveloperApi ::
       * Implemented by subclasses to compute a given partition.
       */
      @DeveloperApi
      def compute(split: Partition, context: TaskContext): Iterator[T]

      /**
       * Implemented by subclasses to return the set of partitions in this RDD. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       *
       * The partitions in this array must satisfy the following property:MapPartitionsRDD
       *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
       */
      protected def getPartitions: Array[Partition]
}

另外RDD 提供了一个可以被子类重写的方法，通过getDependencies来重写该类型RDD的具体依赖形式：

protected def getDependencies: Seq[Dependency[_]] = deps

除此之外，RDD还有一个以RDD[_]作为成员变量的构造函数，则默认是OneToOneDependency 依赖，也就是窄依赖。

/** Construct an RDD with just a one-to-one dependency on one parent */
def this(@transient oneParent: RDD[_]) =
  this(oneParent.context, List(new OneToOneDependency(oneParent)))

RDD 只能新建或者其他RDD变换而来，新建的RDD 必然没有父RDD，所以默认是窄依赖。

而其他都是变换而来，具体的依赖形式就取决于transformation 函数。transformation 函数是将一个或多个RDD经过变换操作返回一个新的RDD过程。返回的RDD类型也随着transformation不同而不同。

不同RDD实现类有不同的方法，具体的RDD 实现类有如下几种类型：

以常见flatMap为例，它返回的RDD类型是：MapPartitionsRDD

/**
*  Return a new RDD by first applying a function to all elements of this
*  RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
}

MapPartitionsRDD 在构建时，采用的是父类RDD的this(@transient oneParent: RDD[_]) 构造器，则默认是窄依赖。

class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isFromBarrier: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev)

2、Dependency

依赖关系分为两大类：NarrowDependency （窄依赖）和ShuffleDependency（宽依赖）。其中NarrowDependency 又分为三类：OneToOneDependency ，RangeDependency 以及PruneDependency 。

我们看下Dependency 的所有实现类：

Map，filter 属于一对一依赖OneToOneDependency，表示子RDD和父RDD的Partition之间的关系是1对1的。例如：

unoin 属于 Range依赖RangeDependency，表示子RDD与父RDD可能是多对一，也可能是一对多。例如：

filterByRange属于 Prune依赖 PruneDependency，表示子RDD的Partition来自父RDD的多个Partition。

3、常见的产生shuffle的transformation

是否为shuffle依赖除了和transformation 类型有关，通常和RDD、父RDD的分区函数、分区数量有一定的关系，我们具体的看一下一些可能产生shuffle依赖的transformation函数的的实现：groupBy 、 join

3.1 groupByKey

groupByKey相似的有combineByKey/reduceByKey操作，RDD 引入了 PairRDDFunctions隐式方法，RDD对应的groupByKey/combineByKey/reduceByKey 操作最终都由PairRDDFunctions对应combineByKeyWithClassTag方法实现

def combineByKeyWithClassTag[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
  require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
  if (keyClass.isArray) {
    if (mapSideCombine) {
      throw new SparkException("Cannot use map-side combining with array keys.")
    }
    if (partitioner.isInstanceOf[HashPartitioner]) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
  }
  val aggregator = new Aggregator[K, V, C](
    self.context.clean(createCombiner),
    self.context.clean(mergeValue),
    self.context.clean(mergeCombiners))
  if (self.partitioner == Some(partitioner)) {
    self.mapPartitions(iter => {
      val context = TaskContext.get()
      new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
    }, preservesPartitioning = true)
  } else {
    new ShuffledRDD[K, V, C](self, partitioner)
      .setSerializer(serializer)
      .setAggregator(aggregator)
      .setMapSideCombine(mapSideCombine)
  }
}

这样关键是在于self.partitioner == Some(partitioner) ，self.partitioner 为当前操作RDD对应的分区函数，partitioner 为方法的入参，如果两者相同则为窄依赖，如果不相同则为宽依赖。

入参的 partitioner 有两种可能，一种是指定了重分区则默认为 HashPartitioner，一种是没指定分区，则默认为当前RDD的分区函数。

def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(defaultPartitioner(self))
}

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(new HashPartitioner(numPartitions))
}
只有当子RDD的分区函数 与 与父分区函数相同，且分区数相同，self.partitioner 才有 Some(partitioner) 相等
override def equals(other: Any): Boolean = other match {
  case h: HashPartitioner =>
    h.numPartitions == numPartitions
  case _ =>
    false
}

另外如果是RangePartitioner分区函数，则需要

override def equals(other: Any): Boolean = other match {
  case r: RangePartitioner[_, _] =>
    r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
  case _ =>
    false
}

否则返回ShuffledRDD，对应的的依赖则为 ShuffleDependency

override def getDependencies: Seq[Dependency[_]] = {
  val serializer = userSpecifiedSerializer.getOrElse {
    val serializerManager = SparkEnv.get.serializerManager
    if (mapSideCombine) {
      serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
    } else {
      serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
    }
  }
  List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}

测试Code：

object Test {
  def main(args: Array[String]): Unit = {
    val p1 = new HashPartitioner(3)
    val p2 = new HashPartitioner(3)
    println(s"p1 == p2 ? = ${Some(p1) == Some(p2)}")
    val p3 = new HashPartitioner(1)
    println(s"p1 == p3 ? = ${Some(p1) == Some(p3)}" )

    val spark = Sparkutil.getSpark()
    val rdd1 = spark.sparkContext.parallelize(Array((1, 2), (2, 3), (1, 3), (3, 3), (4, 3))).repartition(2)

    println("rdd1 partitioner:" + rdd1.partitioner)
    println("rdd1 partitioner:" + rdd1.dependencies)
    println("------------------------------------------ \n")

    val rdd2 = rdd1.repartition(1)
    println(s"rdd1:[${rdd1.partitioner}] == rdd2:[${rdd2.partitioner}] ? = ${rdd1.partitioner == rdd2.partitioner }")
    println("rdd2 dependencies:" + rdd2.dependencies)
    println("------------------------------------------ \n")


    val rdd3 = rdd2.groupByKey(1)
    println(s"rdd2:[${rdd2.partitioner}] == rdd3:[${rdd3.partitioner}] ? = ${rdd2.partitioner == rdd3.partitioner }")
    println("rdd3 partitioner:" + rdd3.dependencies)
    println("------------------------------------------ \n")


    val rdd4 = rdd3.groupByKey(1)
    println(s"rdd3:[${rdd3.partitioner}] == rdd4:[${rdd4.partitioner}] ? = ${rdd3.partitioner == rdd4.partitioner }")
    println("rdd4 partitioner:" + rdd4.dependencies)
    println("------------------------------------------ \n")

    val rdd5 = rdd4.groupByKey(2)
    println(s"rdd4:[${rdd4.partitioner}] == rdd5:[${rdd5.partitioner}] ? = ${rdd4.partitioner == rdd5.partitioner }")
    println("rdd5 partitioner:" + rdd5.dependencies)

  }
}

执行结果：

p1 == p2 ? = true
p1 == p3 ? = false

rdd1 partitioner:None
rdd1 partitioner:List(org.apache.spark.OneToOneDependency@31edeac)
------------------------------------------ 

rdd1:[None] == rdd2:[None] ? = true
rdd2 dependencies:List(org.apache.spark.OneToOneDependency@74db12c2)
------------------------------------------ 

rdd2:[None] == rdd3:[Some(org.apache.spark.HashPartitioner@1)] ? = false
rdd3 partitioner:List(org.apache.spark.ShuffleDependency@2dd8239)
------------------------------------------ 

rdd3:[Some(org.apache.spark.HashPartitioner@1)] == rdd4:[Some(org.apache.spark.HashPartitioner@1)] ? = true
rdd4 partitioner:List(org.apache.spark.OneToOneDependency@319c3a25)
------------------------------------------ 

rdd4:[Some(org.apache.spark.HashPartitioner@1)] == rdd5:[Some(org.apache.spark.HashPartitioner@2)] ? = false
rdd5 partitioner:List(org.apache.spark.ShuffleDependency@1dd7796b)

3.2 join

另外join是常见的会产生shuffle的函数， join操作本质上是cogroup 操作，对应返回的是CoGroupedRDD

/**
* :: DeveloperApi ::
* An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a
* tuple with the list of values for that key.
*
* @param rdds parent RDDs.
* @param part partitioner used to partition the shuffle output
*
* @note This is an internal API. We recommend users use RDD.cogroup(...) instead of
* instantiating this directly.
*/
class CoGroupedRDD[K: ClassTag](
    @transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
    part: Partitioner)
  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil)


override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](
        rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
    }
  }
}

/**
* For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
* tuple with the list of values for that key in `this`, `other1` and `other2`.
*/
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)])
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope {
  cogroup(other1, other2, defaultPartitioner(self, other1, other2))
}

最关键的是 defaultPartitioner 方法

/**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
* as the default partitions number, otherwise we'll use the max number of upstream partitions.
*
* When available, we choose the partitioner from rdds with maximum number of partitions. If this
* partitioner is eligible (number of partitions within an order of maximum number of partitions
* in rdds), or has partition number higher than or equal to default partitions number - we use
* this partitioner.
*
* Otherwise, we'll use a new HashPartitioner with the default partitions number.
*
* Unless spark.default.parallelism is set, the number of partitions will be the same as the
* number of partitions in the largest upstream RDD, as this should be least likely to cause
* out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
  val rdds = (Seq(rdd) ++ others)
  val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))


  val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
    Some(hasPartitioner.maxBy(_.partitions.length))
  } else {
    None
  }


  val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
    rdd.context.defaultParallelism
  } else {
    rdds.map(_.partitions.length).max
  }


  // If the existing max partitioner is an eligible one, or its partitions number is larger
  // than or equal to the default number of partitions, use the existing partitioner.
  if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
      defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
    hasMaxPartitioner.get.partitioner.get
  } else {
    new HashPartitioner(defaultNumPartitions)
  }
}

1、首先找到所有分区数大于0的rdd列表为 hasPartitioner

2、再找到分数数最大的 RDD 为 hasMaxPartitioner

3、然后计算默认分区数 defaultNumPartitions ，如果填写参数 “spark.default.parallelism"，如果没有使用所有RDD中partition 最大的值

4、如果 hasMaxPartitioner不为空且（rdds 中最大分数数不会比任何一个RDD分区数大于10倍以上）或者默认分区数小于hasMaxPartitioner的分区数。则采用hasMaxPartitioner 的分区函数