窄依赖会划分在同一个Stage里面,宽依赖会分布在不同的Stage里面,DAGScheduler是根据宽依赖来划分Stage的。
测试代码:
val conf: SparkConf = new SparkConf().setMaster("local[10]").setAppName(this.getClass.getCanonicalName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("Warn")
println(this.getClass.getCanonicalName) // 包名 + 类名
val random=scala.util.Random
val col1=Range(1,50).map(idx=>(random.nextInt(10),s"user$idx"))
val col2=Array((0,"BJ"),(1,"SH"),(2,"GZ"),(3,"SZ"),(4,"TJ"),(5,"CQ"),(6,"HZ"),(7,"NJ"),(8,"WH"),(0,"CD"))
val rdd1:RDD[(Int,String)]=sc.makeRDD(col1)
val rdd2:RDD[(Int,String)]=sc.makeRDD(col2)
val rdd3:RDD[(Int,(String,String))]=rdd1.join(rdd2)
rdd3.count()
println(rdd3.dependencies) // 宽依赖 List(org.apache.spark.OneToOneDependency@1af1cf17)
val rdd4:RDD[(Int,(String,String))]=rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
rdd4.count()
println(rdd4.dependencies) // 窄依赖 List(org.apache.spark.OneToOneDependency@6fcd5026)
Thread.sleep(100000)
sc.stop()
输出:
cogroup.Join_Dependency$
List(org.apache.spark.OneToOneDependency@5cb445f2)
List(org.apache.spark.OneToOneDependency@24932b2d)
对应依赖:
rdd3 对应的是宽依赖
rdd4 对应的是窄依赖
Spark UI
代码解析:
- 首先join会传入被join的rdd和一个默认的分区器
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Performs a hash join across the cluster.
*/
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
join(other, defaultPartitioner(self, other))
}
- 下面代码,defaultNumPartitions不指定分区数,默认是电脑core数量作为分区数,指定了分区数,rdds.map(_.partitions.length).max 会选取两个rdd中最大分区数,作为join后的分区,最后new HashPartitioner(defaultNumPartitions)返回对应数量的哈希分区器。
在这里,rdd3的join会返回分区数量为10 的哈希分区器,rdd4的join 会返回我们设定的HashPartitioner(分区数目3)
/**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
* as the default partitions number, otherwise we'll use the max number of upstream partitions.
*
* When available, we choose the partitioner from rdds with maximum number of partitions. If this
* partitioner is eligible (number of partitions within an order of maximum number of partitions
* in rdds), or has partition number higher than or equal to default partitions number - we use
* this partitioner.
*
* Otherwise, we'll use a new HashPartitioner with the default partitions number.
*
* Unless spark.default.parallelism is set, the number of partitions will be the same as the
* number of partitions in the largest upstream RDD, as this should be least likely to cause
* out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
Some(hasPartitioner.maxBy(_.partitions.length))
} else {
None
}
val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
rdd.context.defaultParallelism
} else {
rdds.map(_.partitions.length).max
}
// If the existing max partitioner is an eligible one, or its partitions number is larger
// than or equal to the default number of partitions, use the existing partitioner.
if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
hasMaxPartitioner.get.partitioner.get
} else {
new HashPartitioner(defaultNumPartitions)
}
}
- 此时走到join实际执行方法,由于flatMapValues是窄依赖,我们看下cogroup内部怎么实现的即可
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
- 进入cogroup方法中,核心是CoGroupedRDD,根据两个需要join的rdd和一个分区器。由于第一个join的时候,两个rdd都没有分区器,所以在这一步,两个rdd需要先根据传入的分区器进行一次shuffle,走new ShuffleDependency因此第一个rdd3 join是宽依赖。第二个rdd4 join此时已经分好区了,走new OneToOneDependency(rdd)不需要再再进行shuffle了。所以第二个是窄依赖
/**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
-
看CoGroupedRDD的getDependencies方法的实现
-
为什么两个join都会打印OneToOneDependency?还是看CoGroupedRDD的getDependencies的实现,如果有partitioner就会返回OneToOneDependency(rdd3 在join后就会被分配默认的哈希分区器)
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
println(rdd1.partitioner)
println(rdd2.partitioner)
println(rdd3.partitioner)
输出:
None
None
Some(org.apache.spark.HashPartitioner@a)
总结:
join什么时候是宽依赖什么时候是窄依赖?
由上述分析可以知道,如果需要join的两个表,本身已经有分区器,且分区的数目相同,此时,相同的key在同一个分区内。就是窄依赖。反之,如果两个需要join的表中没有分区器或者分区数量不同,在join的时候需要shuffle,那么就是宽依赖