代码如下
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("LocalTest").setMaster("local[*]")
val sc: SparkContext = new SparkContext(sparkConf)
val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5),3)
println(rdd.toDebugString)
println(rdd.dependencies)
println("-------------")
val mapRDD: RDD[(Int, Int)] = rdd.map((_,1))
println(mapRDD.toDebugString)
println(mapRDD.dependencies)
println("-------------")
val resRDD: RDD[(Int, Int)] = mapRDD.reduceByKey(_+_)
println(resRDD.toDebugString)
println(resRDD.dependencies)
println("-------------")
}
结果如下:
(3) ParallelCollectionRDD[0] at makeRDD at TestBiBao.scala:12 []
List()
-------------
(3) MapPartitionsRDD[1] at map at TestBiBao.scala:17 []
| ParallelCollectionRDD[0] at makeRDD at TestBiBao.scala:12 []
List(org.apache.spark.OneToOneDependency@5fb7183b)
-------------
(3) ShuffledRDD[2] at reduceByKey at TestBiBao.scala:22 []
+-(3) MapPartitionsRDD[1] at map at TestBiBao.scala:17 []
| ParallelCollectionRDD[0] at makeRDD at TestBiBao.scala:12 []
List(org.apache.spark.ShuffleDependency@7afbf561)
-------------
Process finished with exit code 0
根据数据结果显示:
1、(3) 表示该RDD有3个分区
2、rdd是创建的第一个RDD,所以没有依赖,并且是 ParallelCollectionRDD 类型;
3、mapRDD调用map(),并且依赖于rdd,是 MapPartitionsRDD类型;
4、resRDD调用reduceByKey(),并且依赖于mapRDD,是ShuffledRDD类型;
同时可以看到
resRDD所属依赖是 ShuffleDependency
MapPartitionsRDD 所属依赖是 OneToOneDependency
查看源码
ShuffleDependency 直接继承了 Dependency类
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]]
OneToOneDependency 继承了 NarrowDependency、Dependency类型
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
* Get the parent partitions for a child partition.
* @param partitionId a partition of the child RDD
* @return the partitions of the parent RDD that the child partition depends upon
*/
def getParents(partitionId: Int): Seq[Int]
override def rdd: RDD[T] = _rdd
}
可以清晰的看到,Dependency类分为两个实现类,ShuffleDependency (宽依赖)、NarrowDependency(窄依赖),同时NarrowDependency又分为三个类型。
那么什么是窄依赖、宽依赖
窄依赖指的是生成的RDD中每个partition只依赖于父RDD(s)固定的partition。
宽依赖指的是生成的RDD的每一个partition都依赖于父 RDD(s)所有partition。
窄依赖典型的操作有map, filter, union(特殊)等
宽依赖典型的操作有groupByKey, sortByKey等。
即一个父RDD只被一个子RDD依赖,就是窄依赖,反之就是宽依赖