前言:
宽窄依赖其实就是父子RDD之前的依赖类型,RDD的构造函数中存在deps:Seq[Dependency] 保存的就是这个信息,Dependency 分类两大类也就是。我们都知道RDD只能来源于新建或者从其他RRD transformation而来,而决定RDD间的依赖类型就大多取决于具体的transformation 函数。
spark 是按照 当前RDD 与父RDD的依赖关系来划分的stage的,如果宽依赖则划分为不同stage,如果是窄依赖则在为同一个stage中的pipeline。具体示例图如下:
1、RDD
首先我们来看RDD源码,RDD是一个抽象类、实现了大部分的RDD通用方法。RDD的默认构造函数 由两个成员变量,一个是SparkContext ,另一个是deps:Seq[Dependency[_]],其中deps 就是RDD与所有父RDD的依赖关系。RDD有两个抽象方法需要去重写,一个是compute,一个是getPartitions。
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*
* The partitions in this array must satisfy the following property:MapPartitionsRDD
* `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
*/
protected def getPartitions: Array[Partition]
}
另外RDD 提供了一个可以被子类重写的方法,通过getDependencies来重写该类型RDD的具体依赖形式:
protected def getDependencies: Seq[Dependency[_]] = deps
除此之外,RDD还有一个以RDD[_]作为成员变量的构造函数,则默认是OneToOneDependency 依赖,也就是窄依赖。
/** Construct an RDD with just a one-to-one dependency on one parent */
def this(@transient oneParent: RDD[_]) =
this(oneParent.context, List(new OneToOneDependency(oneParent)))
RDD 只能新建或者其他RDD变换而来,新建的RDD 必然没有父RDD,所以默认是窄依赖。
而其他都是变换而来,具体的依赖形式就取决于transformation 函数。transformation 函数是将一个或多个RDD经过变换操作返回一个新的RDD过程。返回的RDD类型也随着transformation不同而不同。
以常见flatMap为例,它返回的RDD类型是:MapPartitionsRDD
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
}
MapPartitionsRDD 在构建时,采用的是父类RDD的this(@transient oneParent: RDD[_]) 构造器, 则默认是窄依赖。
class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false,
isFromBarrier: Boolean = false,
isOrderSensitive: Boolean = false)
extends RDD[U](prev)
2、Dependency
依赖关系分为两大类:NarrowDependency (窄依赖)和ShuffleDependency(宽依赖)。其中NarrowDependency 又分为三类:OneToOneDependency ,RangeDependency 以及PruneDependency 。
我们看下Dependency 的所有实现类:
Map,filter 属于一对一依赖OneToOneDependency,表示子RDD和父RDD的Partition之间的关系是1对1的。例如:
unoin 属于 Range依赖RangeDependency,表示子RDD与父RDD可能是多对一,也可能是一对多。例如:
filterByRange属于 Prune依赖 PruneDependency,表示子RDD的Partition来自父RDD的多个Partition。
3、常见的产生shuffle的transformation
是否为shuffle依赖除了和transformation 类型有关,通常和RDD、父RDD的 分区函数、分区数量有一定的关系,我们具体的看一下一些可能产生shuffle依赖的transformation函数的的实现:groupBy 、 join
3.1 groupByKey
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
这样关键是在于self.partitioner == Some(partitioner) ,self.partitioner 为当前操作RDD对应的分区函数,partitioner 为方法的入参, 如果两者相同则为窄依赖, 如果不相同则为宽依赖。
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(new HashPartitioner(numPartitions))
}
只有当子RDD的分区函数 与 与父分区函数相同,且分区数相同,self.partitioner 才有 Some(partitioner) 相等
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
另外如果是RangePartitioner分区函数,则需要
override def equals(other: Any): Boolean = other match {
case r: RangePartitioner[_, _] =>
r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
case _ =>
false
}
否则返回ShuffledRDD,对应的的依赖则 为 ShuffleDependency
override def getDependencies: Seq[Dependency[_]] = {
val serializer = userSpecifiedSerializer.getOrElse {
val serializerManager = SparkEnv.get.serializerManager
if (mapSideCombine) {
serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
} else {
serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
}
}
List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}
测试Code:
object Test {
def main(args: Array[String]): Unit = {
val p1 = new HashPartitioner(3)
val p2 = new HashPartitioner(3)
println(s"p1 == p2 ? = ${Some(p1) == Some(p2)}")
val p3 = new HashPartitioner(1)
println(s"p1 == p3 ? = ${Some(p1) == Some(p3)}" )
val spark = Sparkutil.getSpark()
val rdd1 = spark.sparkContext.parallelize(Array((1, 2), (2, 3), (1, 3), (3, 3), (4, 3))).repartition(2)
println("rdd1 partitioner:" + rdd1.partitioner)
println("rdd1 partitioner:" + rdd1.dependencies)
println("------------------------------------------ \n")
val rdd2 = rdd1.repartition(1)
println(s"rdd1:[${rdd1.partitioner}] == rdd2:[${rdd2.partitioner}] ? = ${rdd1.partitioner == rdd2.partitioner }")
println("rdd2 dependencies:" + rdd2.dependencies)
println("------------------------------------------ \n")
val rdd3 = rdd2.groupByKey(1)
println(s"rdd2:[${rdd2.partitioner}] == rdd3:[${rdd3.partitioner}] ? = ${rdd2.partitioner == rdd3.partitioner }")
println("rdd3 partitioner:" + rdd3.dependencies)
println("------------------------------------------ \n")
val rdd4 = rdd3.groupByKey(1)
println(s"rdd3:[${rdd3.partitioner}] == rdd4:[${rdd4.partitioner}] ? = ${rdd3.partitioner == rdd4.partitioner }")
println("rdd4 partitioner:" + rdd4.dependencies)
println("------------------------------------------ \n")
val rdd5 = rdd4.groupByKey(2)
println(s"rdd4:[${rdd4.partitioner}] == rdd5:[${rdd5.partitioner}] ? = ${rdd4.partitioner == rdd5.partitioner }")
println("rdd5 partitioner:" + rdd5.dependencies)
}
}
执行结果:
p1 == p2 ? = true
p1 == p3 ? = false
rdd1 partitioner:None
rdd1 partitioner:List(org.apache.spark.OneToOneDependency@31edeac)
------------------------------------------
rdd1:[None] == rdd2:[None] ? = true
rdd2 dependencies:List(org.apache.spark.OneToOneDependency@74db12c2)
------------------------------------------
rdd2:[None] == rdd3:[Some(org.apache.spark.HashPartitioner@1)] ? = false
rdd3 partitioner:List(org.apache.spark.ShuffleDependency@2dd8239)
------------------------------------------
rdd3:[Some(org.apache.spark.HashPartitioner@1)] == rdd4:[Some(org.apache.spark.HashPartitioner@1)] ? = true
rdd4 partitioner:List(org.apache.spark.OneToOneDependency@319c3a25)
------------------------------------------
rdd4:[Some(org.apache.spark.HashPartitioner@1)] == rdd5:[Some(org.apache.spark.HashPartitioner@2)] ? = false
rdd5 partitioner:List(org.apache.spark.ShuffleDependency@1dd7796b)
3.2 join
另外join是常见的会产生shuffle的函数, join操作本质上是cogroup 操作,对应返回的是CoGroupedRDD
/**
* :: DeveloperApi ::
* An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a
* tuple with the list of values for that key.
*
* @param rdds parent RDDs.
* @param part partitioner used to partition the shuffle output
*
* @note This is an internal API. We recommend users use RDD.cogroup(...) instead of
* instantiating this directly.
*/
class CoGroupedRDD[K: ClassTag](
@transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
part: Partitioner)
extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil)
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
/**
* For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
* tuple with the list of values for that key in `this`, `other1` and `other2`.
*/
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)])
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope {
cogroup(other1, other2, defaultPartitioner(self, other1, other2))
}
最关键的是 defaultPartitioner 方法
/**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
* as the default partitions number, otherwise we'll use the max number of upstream partitions.
*
* When available, we choose the partitioner from rdds with maximum number of partitions. If this
* partitioner is eligible (number of partitions within an order of maximum number of partitions
* in rdds), or has partition number higher than or equal to default partitions number - we use
* this partitioner.
*
* Otherwise, we'll use a new HashPartitioner with the default partitions number.
*
* Unless spark.default.parallelism is set, the number of partitions will be the same as the
* number of partitions in the largest upstream RDD, as this should be least likely to cause
* out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
Some(hasPartitioner.maxBy(_.partitions.length))
} else {
None
}
val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
rdd.context.defaultParallelism
} else {
rdds.map(_.partitions.length).max
}
// If the existing max partitioner is an eligible one, or its partitions number is larger
// than or equal to the default number of partitions, use the existing partitioner.
if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
hasMaxPartitioner.get.partitioner.get
} else {
new HashPartitioner(defaultNumPartitions)
}
}