spark编程模型(三)之RDD依赖关系

RDD依赖关系

在RDD中将依赖划分成两种类型:窄依赖(Narrow Dependencies) 和 宽依赖(Wide Dependencies)
这里写图片描述

窄依赖

每个父RDD的分区都至多被一个子RDD的分区使用

父RDD与子RDD的关系为1 对 1(一个父RDD对应一个子RDD) 或者 n 对 1(多个父RDD对应一个子RDD)

比如 mapfilterunion

宽依赖

多个子RDD的分区依赖一个父RDD的分区

父RDD与子RDD的关系为n 对 n,宽依赖往往对应着shuffle操作

比如 groupByKeyreduceByKeysortByKey

宽依赖与窄依赖的区别

(1)窄依赖允许在单个集群节点上流水线式执行,这个节点可以计算父级分区,例如,可以逐个元素地依次执行filter操作与map操作;

相反,宽依赖需要所有的父RDD数据可用,并且数据已经Shuffle操作。
这里写图片描述
当对RDD执行转换操作时,调度器会根据RDD的"血统"来构建由若干调度阶段(Stage)组成的有向无环图(DAG)

对于窄依赖,由于partition依赖关系的确定性,partition的转换处理就可以在同一个线程里完成,窄依赖就被spark划分到同一个stage中,构成流水线(pipeline),如图RDD C、D、E、F它们都在Stage2中。

而对于宽依赖,只能等父RDD shuffle处理完成后,下一个stage才能开始接下来的计算,因此宽依赖要单独划分一个Stage,如上图的RDD A

(2)在窄依赖中,节点失败后的恢复更加高效。因为只有丢失的父级分区需要重新计算,并且这些丢失的父级分区可以并行地在不同节点上重新计算。

与此相反,在宽依赖的继承关系中,单个失败的节点可能导致一个RDD的所有父RDD中的一些分区丢失,从而导致计算的重新执行

Dependency源码分析

Dependency是个抽象类,是表示RDD依赖关系的基类

/**
 * :: DeveloperApi ::
 * Base class for dependencies.
 */
@DeveloperApi
abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}

我们可以看到这个类里面有一个属性rdd,即Dependency就是对父RDD的包装,并且通过Dependency的类型说明当前这个transformation对应的数据处理方式,它有两个子类:NarrowDependencyShuffleDenpendency,分别对应窄依赖和宽依赖

(1)NarrowDependency(窄依赖)

@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

NarrowDependency也是一个抽象类,定义了抽象方法getParents,输入partitionId,用于获得child RDD的某个partition依赖的parent RDD的所有partitions。

NarrowDependency有三个具体的实现:OneToOneDependencyRangeDependency以及PruneDependency

(1_a)OneToOneDependency

@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

OneToOneDependency是指child RDDpartition只依赖于parent RDD的一个partition,产生OneToOneDependency的算子有mapfilterflatMap等。可以看到getParents实现很简单,就是传进去一个partitionId,再把partitionId放在List里面传出去

(1_b)RangeDependency

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
 * @param rdd the parent RDD 父RDD
 * @param inStart the start of the range in the parent RDD 父RDD的开始索引
 * @param outStart the start of the range in the child RDD 子RDD的开始索引
 * @param length the length of the range 索引范围长度
 */
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}

RangeDependency是指child RDD partition在一定的范围内一对一的依赖于parent RDD partition,主要用于union

(1_c)PruneDependency

private[spark] class PruneDependency[T](rdd: RDD[T], partitionFilterFunc: Int => Boolean)
  extends NarrowDependency[T](rdd) {

  @transient
  val partitions: Array[Partition] = rdd.partitions
    .filter(s => partitionFilterFunc(s.index)).zipWithIndex
    .map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition }

  override def getParents(partitionId: Int): List[Int] = {
    List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index)
  }
}

子RDD的Partition来自父RDD的多个Partition,filterByRange方法时会使用

(2)ShuffleDenpendency(宽依赖)

/**
 * :: DeveloperApi ::
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
 * the RDD is transient since we don't need it on the executor side.
 *
 * @param _rdd the parent RDD
 * @param partitioner partitioner used to partition the shuffle output
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
 *                   explicitly then the default serializer, as specified by `spark.serializer`
 *                   config option, will be used.
 * @param keyOrdering key ordering for RDD's shuffles
 * @param aggregator map/reduce-side aggregator for RDD's shuffle
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
 */
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  // Note: It's possible that the combiner class tag is null, if the combineByKey
  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName)

  val shuffleId: Int = _rdd.context.newShuffleId()

  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}

由于shuffle涉及到网络传输,所以要有序列化serializer,为了减少网络传输,可以map段聚合,通过mapSideCombineaggregator控制,还有key排序相关的keyOrdering,以及重输出的数据如何分区的partitioner,还有一些class信息。Partition之间的关系在shuffle出戛然而止,因此shuffle是划分stage的依据。

示例
scala> val rdd = sc.textFile("/wordcount/word.txt")
rdd: org.apache.spark.rdd.RDD[String] = /wordcount/word.txt MapPartitionsRDD[1] at textFile at <console>:24

这里使用wordcount为例

scala> val word_map = rdd.flatMap((_.split(" "))).filter((_ != " ")).map((_, 1))
word_map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26

这里的flatMapfiltermap会组成流水线pipeline

scala> word_map.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@5b2235a5)

scala> word_map.dependencies.size
res1: Int = 1

可以看到word_map的依赖关系为OneToOneDependency,属于窄依赖

scala> word_map.dependencies.foreach{ dep => 
     | println(dep.getClass)
     | println(dep.rdd)
     | println(dep.rdd.partitions)
     | println(dep.rdd.partitions.size)
     | }
class org.apache.spark.OneToOneDependency
MapPartitionsRDD[3] at filter at <console>:26
[Lorg.apache.spark.Partition;@54077082
1

打印word_map依赖关系的一些信息

scala> val word_reduce = word_map.reduceByKey((_ + _))
word_reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:28

然后使用reduceByKey操作对单词进行计数,这里会产生shuffle操作

scala> word_reduce.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@4968eb3f)

scala> word_reduce.dependencies.size
res5: Int = 1

可以看到word_reduce的依赖关系为ShuffleDependency,属于宽依赖

scala> word_reduce.dependencies.foreach{ dep => 
     | println(dep.getClass)
     | println(dep.rdd)
     | println(dep.rdd.partitions)
     | println(dep.rdd.partitions.size)
     | }
class org.apache.spark.ShuffleDependency
MapPartitionsRDD[4] at map at <console>:26
[Lorg.apache.spark.Partition;@54077082
1

打印word_reduce依赖关系的一些信息

转载于:https://www.cnblogs.com/oldsix666/articles/9458191.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值