spark编程模型（三）之RDD依赖关系

最新推荐文章于 2023-02-03 11:13:28 发布

dehou1984

最新推荐文章于 2023-02-03 11:13:28 发布

阅读量233

点赞数

文章标签：大数据

原文链接：http://www.cnblogs.com/oldsix666/articles/9458191.html

版权

RDD依赖关系

在RDD中将依赖划分成两种类型：窄依赖(Narrow Dependencies) 和宽依赖(Wide Dependencies)
这里写图片描述

窄依赖

每个父RDD的分区都至多被一个子RDD的分区使用

父RDD与子RDD的关系为1 对 1(一个父RDD对应一个子RDD) 或者 n 对 1(多个父RDD对应一个子RDD)

比如 map 、filter 、 union

宽依赖

多个子RDD的分区依赖一个父RDD的分区

父RDD与子RDD的关系为n 对 n，宽依赖往往对应着shuffle操作

比如 groupByKey、reduceByKey、sortByKey

宽依赖与窄依赖的区别

(1)窄依赖允许在单个集群节点上流水线式执行，这个节点可以计算父级分区，例如，可以逐个元素地依次执行filter操作与map操作；

相反，宽依赖需要所有的父RDD数据可用，并且数据已经Shuffle操作。
这里写图片描述
当对RDD执行转换操作时，调度器会根据RDD的"血统"来构建由若干调度阶段(Stage)组成的有向无环图(DAG)

对于窄依赖，由于partition依赖关系的确定性，partition的转换处理就可以在同一个线程里完成，窄依赖就被spark划分到同一个stage中，构成流水线(pipeline)，如图RDD C、D、E、F它们都在Stage2中。

而对于宽依赖，只能等父RDD shuffle处理完成后，下一个stage才能开始接下来的计算，因此宽依赖要单独划分一个Stage，如上图的RDD A

(2)在窄依赖中，节点失败后的恢复更加高效。因为只有丢失的父级分区需要重新计算，并且这些丢失的父级分区可以并行地在不同节点上重新计算。

与此相反，在宽依赖的继承关系中，单个失败的节点可能导致一个RDD的所有父RDD中的一些分区丢失，从而导致计算的重新执行

`Dependency`源码分析

Dependency是个抽象类，是表示RDD依赖关系的基类

/**
 * :: DeveloperApi ::
 * Base class for dependencies.
 */
@DeveloperApi
abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}

我们可以看到这个类里面有一个属性rdd，即Dependency就是对父RDD的包装，并且通过Dependency的类型说明当前这个transformation对应的数据处理方式,它有两个子类：NarrowDependency 和 ShuffleDenpendency，分别对应窄依赖和宽依赖

(1)NarrowDependency(窄依赖)

@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

NarrowDependency也是一个抽象类，定义了抽象方法getParents，输入partitionId，用于获得child RDD的某个partition依赖的parent RDD的所有partitions。

NarrowDependency有三个具体的实现：OneToOneDependency和RangeDependency以及PruneDependency

(1_a)OneToOneDependency

@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

OneToOneDependency是指child RDD的partition只依赖于parent RDD的一个partition，产生OneToOneDependency的算子有map，filter，flatMap等。可以看到getParents实现很简单，就是传进去一个partitionId，再把partitionId放在List里面传出去

(1_b)RangeDependency

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
 * @param rdd the parent RDD 父RDD
 * @param inStart the start of the range in the parent RDD 父RDD的开始索引
 * @param outStart the start of the range in the child RDD 子RDD的开始索引
 * @param length the length of the range 索引范围长度
 */
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}

RangeDependency是指child RDD partition在一定的范围内一对一的依赖于parent RDD partition，主要用于union等

(1_c)PruneDependency

private[spark] class PruneDependency[T](rdd: RDD[T], partitionFilterFunc: Int => Boolean)
  extends NarrowDependency[T](rdd) {

  @transient
  val partitions: Array[Partition] = rdd.partitions
    .filter(s => partitionFilterFunc(s.index)).zipWithIndex
    .map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition }

  override def getParents(partitionId: Int): List[Int] = {
    List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index)
  }
}

子RDD的Partition来自父RDD的多个Partition，filterByRange方法时会使用

(2)ShuffleDenpendency(宽依赖)

/**
 * :: DeveloperApi ::
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
 * the RDD is transient since we don't need it on the executor side.
 *
 * @param _rdd the parent RDD
 * @param partitioner partitioner used to partition the shuffle output
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
 *                   explicitly then the default serializer, as specified by `spark.serializer`
 *                   config option, will be used.
 * @param keyOrdering key ordering for RDD's shuffles
 * @param aggregator map/reduce-side aggregator for RDD's shuffle
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
 */
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  // Note: It's possible that the combiner class tag is null, if the combineByKey
  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName)

  val shuffleId: Int = _rdd.context.newShuffleId()

  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}

由于shuffle涉及到网络传输，所以要有序列化serializer，为了减少网络传输，可以map段聚合，通过mapSideCombine和aggregator控制，还有key排序相关的keyOrdering，以及重输出的数据如何分区的partitioner，还有一些class信息。Partition之间的关系在shuffle出戛然而止，因此shuffle是划分stage的依据。

示例

scala> val rdd = sc.textFile("/wordcount/word.txt")
rdd: org.apache.spark.rdd.RDD[String] = /wordcount/word.txt MapPartitionsRDD[1] at textFile at <console>:24

这里使用wordcount为例

scala> val word_map = rdd.flatMap((_.split(" "))).filter((_ != " ")).map((_, 1))
word_map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26

这里的flatMap、filter、map会组成流水线pipeline

scala> word_map.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@5b2235a5)

scala> word_map.dependencies.size
res1: Int = 1

可以看到word_map的依赖关系为OneToOneDependency，属于窄依赖

scala> word_map.dependencies.foreach{ dep => 
     | println(dep.getClass)
     | println(dep.rdd)
     | println(dep.rdd.partitions)
     | println(dep.rdd.partitions.size)
     | }
class org.apache.spark.OneToOneDependency
MapPartitionsRDD[3] at filter at <console>:26
[Lorg.apache.spark.Partition;@54077082
1

打印word_map依赖关系的一些信息

scala> val word_reduce = word_map.reduceByKey((_ + _))
word_reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:28

然后使用reduceByKey操作对单词进行计数，这里会产生shuffle操作

scala> word_reduce.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@4968eb3f)

scala> word_reduce.dependencies.size
res5: Int = 1

可以看到word_reduce的依赖关系为ShuffleDependency,属于宽依赖

scala> word_reduce.dependencies.foreach{ dep => 
     | println(dep.getClass)
     | println(dep.rdd)
     | println(dep.rdd.partitions)
     | println(dep.rdd.partitions.size)
     | }
class org.apache.spark.ShuffleDependency
MapPartitionsRDD[4] at map at <console>:26
[Lorg.apache.spark.Partition;@54077082
1

打印word_reduce依赖关系的一些信息

转载于:https://www.cnblogs.com/oldsix666/articles/9458191.html

dehou1984

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark编程模型（三）之RDD依赖关系

RDD依赖关系在RDD中将依赖划分成两种类型：窄依赖(Narrow Dependencies) 和宽依赖(Wide Dependencies)窄依赖每个父RDD的分区都至多被一个子RDD的分区使用父RDD与子RDD的关系为1 对 1(一个父RDD对应一个子RDD) 或者 n 对 1(多个父RDD对应一个子RDD)比如 map 、filter 、 union宽依赖多个子R...
复制链接

扫一扫