RDD依赖关系
在RDD中将依赖划分成两种类型:窄依赖(Narrow Dependencies
) 和 宽依赖(Wide Dependencies
)
窄依赖
每个父RDD的分区都至多被一个子RDD的分区使用
父RDD与子RDD的关系为1 对 1
(一个父RDD对应一个子RDD) 或者 n 对 1
(多个父RDD对应一个子RDD)
比如 map
、filter
、 union
宽依赖
多个子RDD的分区依赖一个父RDD的分区
父RDD与子RDD的关系为n 对 n
,宽依赖往往对应着shuffle操作
比如 groupByKey
、reduceByKey
、sortByKey
宽依赖与窄依赖的区别
(1)窄依赖允许在单个集群节点上流水线式执行,这个节点可以计算父级分区,例如,可以逐个元素地依次执行filter操作与map操作;
相反,宽依赖需要所有的父RDD数据可用,并且数据已经Shuffle操作。
当对RDD执行转换操作时,调度器会根据RDD的"血统"来构建由若干调度阶段(Stage
)组成的有向无环图(DAG
)
对于窄依赖,由于partition依赖关系的确定性,partition的转换处理就可以在同一个线程里完成,窄依赖就被spark划分到同一个stage中,构成流水线(pipeline
),如图RDD C、D、E、F它们都在Stage2中。
而对于宽依赖,只能等父RDD shuffle
处理完成后,下一个stage才能开始接下来的计算,因此宽依赖要单独划分一个Stage,如上图的RDD A
(2)在窄依赖中,节点失败后的恢复更加高效。因为只有丢失的父级分区需要重新计算,并且这些丢失的父级分区可以并行地在不同节点上重新计算。
与此相反,在宽依赖的继承关系中,单个失败的节点可能导致一个RDD的所有父RDD中的一些分区丢失,从而导致计算的重新执行
Dependency
源码分析
Dependency
是个抽象类,是表示RDD依赖关系的基类
/**
* :: DeveloperApi ::
* Base class for dependencies.
*/
@DeveloperApi
abstract class Dependency[T] extends Serializable {
def rdd: RDD[T]
}
我们可以看到这个类里面有一个属性rdd,即Dependency
就是对父RDD的包装,并且通过Dependency
的类型说明当前这个transformation对应的数据处理方式,它有两个子类:NarrowDependency
和 ShuffleDenpendency
,分别对应窄依赖和宽依赖
(1)NarrowDependency
(窄依赖)
@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
* Get the parent partitions for a child partition.
* @param partitionId a partition of the child RDD
* @return the partitions of the parent RDD that the child partition depends upon
*/
def getParents(partitionId: Int): Seq[Int]
override def rdd: RDD[T] = _rdd
}
NarrowDependency
也是一个抽象类,定义了抽象方法getParents
,输入partitionId
,用于获得child RDD
的某个partition
依赖的parent RDD
的所有partitions。
NarrowDependency
有三个具体的实现:OneToOneDependency
和RangeDependency
以及PruneDependency
(1_a)OneToOneDependency
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
OneToOneDependency
是指child RDD
的partition
只依赖于parent RDD
的一个partition
,产生OneToOneDependency
的算子有map
,filter
,flatMap
等。可以看到getParents
实现很简单,就是传进去一个partitionId
,再把partitionId
放在List
里面传出去
(1_b)RangeDependency
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
* @param rdd the parent RDD 父RDD
* @param inStart the start of the range in the parent RDD 父RDD的开始索引
* @param outStart the start of the range in the child RDD 子RDD的开始索引
* @param length the length of the range 索引范围长度
*/
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = {
if (partitionId >= outStart && partitionId < outStart + length) {
List(partitionId - outStart + inStart)
} else {
Nil
}
}
}
RangeDependency
是指child RDD partition
在一定的范围内一对一的依赖于parent RDD partition
,主要用于union
等
(1_c)PruneDependency
private[spark] class PruneDependency[T](rdd: RDD[T], partitionFilterFunc: Int => Boolean)
extends NarrowDependency[T](rdd) {
@transient
val partitions: Array[Partition] = rdd.partitions
.filter(s => partitionFilterFunc(s.index)).zipWithIndex
.map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition }
override def getParents(partitionId: Int): List[Int] = {
List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index)
}
}
子RDD的Partition来自父RDD的多个Partition,filterByRange
方法时会使用
(2)ShuffleDenpendency
(宽依赖)
/**
* :: DeveloperApi ::
* Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
* the RDD is transient since we don't need it on the executor side.
*
* @param _rdd the parent RDD
* @param partitioner partitioner used to partition the shuffle output
* @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
* explicitly then the default serializer, as specified by `spark.serializer`
* config option, will be used.
* @param keyOrdering key ordering for RDD's shuffles
* @param aggregator map/reduce-side aggregator for RDD's shuffle
* @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
*/
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
// Note: It's possible that the combiner class tag is null, if the combineByKey
// methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
private[spark] val combinerClassName: Option[String] =
Option(reflect.classTag[C]).map(_.runtimeClass.getName)
val shuffleId: Int = _rdd.context.newShuffleId()
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.length, this)
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}
由于shuffle
涉及到网络传输,所以要有序列化serializer
,为了减少网络传输,可以map
段聚合,通过mapSideCombine
和aggregator
控制,还有key
排序相关的keyOrdering
,以及重输出的数据如何分区的partitioner
,还有一些class
信息。Partition
之间的关系在shuffle
出戛然而止,因此shuffle
是划分stage
的依据。
示例
scala> val rdd = sc.textFile("/wordcount/word.txt")
rdd: org.apache.spark.rdd.RDD[String] = /wordcount/word.txt MapPartitionsRDD[1] at textFile at <console>:24
这里使用wordcount为例
scala> val word_map = rdd.flatMap((_.split(" "))).filter((_ != " ")).map((_, 1))
word_map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26
这里的flatMap
、filter
、map
会组成流水线pipeline
scala> word_map.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@5b2235a5)
scala> word_map.dependencies.size
res1: Int = 1
可以看到word_map的依赖关系为OneToOneDependency
,属于窄依赖
scala> word_map.dependencies.foreach{ dep =>
| println(dep.getClass)
| println(dep.rdd)
| println(dep.rdd.partitions)
| println(dep.rdd.partitions.size)
| }
class org.apache.spark.OneToOneDependency
MapPartitionsRDD[3] at filter at <console>:26
[Lorg.apache.spark.Partition;@54077082
1
打印word_map依赖关系的一些信息
scala> val word_reduce = word_map.reduceByKey((_ + _))
word_reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:28
然后使用reduceByKey
操作对单词进行计数,这里会产生shuffle
操作
scala> word_reduce.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@4968eb3f)
scala> word_reduce.dependencies.size
res5: Int = 1
可以看到word_reduce的依赖关系为ShuffleDependency
,属于宽依赖
scala> word_reduce.dependencies.foreach{ dep =>
| println(dep.getClass)
| println(dep.rdd)
| println(dep.rdd.partitions)
| println(dep.rdd.partitions.size)
| }
class org.apache.spark.ShuffleDependency
MapPartitionsRDD[4] at map at <console>:26
[Lorg.apache.spark.Partition;@54077082
1
打印word_reduce依赖关系的一些信息