Spark 之从cogroup的实现来看join是宽依赖还是窄依赖

南风知我意丿

于 2023-04-12 11:13:43 发布

阅读量201

点赞数

分类专栏： Spark 文章标签： spark scala 大数据

本文链接：https://blog.csdn.net/Lzx116/article/details/130102407

版权

Spark 专栏收录该内容

57 篇文章 2 订阅

订阅专栏

文章目录

窄依赖会划分在同一个Stage里面，宽依赖会分布在不同的Stage里面，DAGScheduler是根据宽依赖来划分Stage的。

测试代码：

val conf: SparkConf = new SparkConf().setMaster("local[10]").setAppName(this.getClass.getCanonicalName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("Warn")

println(this.getClass.getCanonicalName) // 包名 + 类名


val random=scala.util.Random
val col1=Range(1,50).map(idx=>(random.nextInt(10),s"user$idx"))
val col2=Array((0,"BJ"),(1,"SH"),(2,"GZ"),(3,"SZ"),(4,"TJ"),(5,"CQ"),(6,"HZ"),(7,"NJ"),(8,"WH"),(0,"CD"))

val rdd1:RDD[(Int,String)]=sc.makeRDD(col1)
val rdd2:RDD[(Int,String)]=sc.makeRDD(col2)

val rdd3:RDD[(Int,(String,String))]=rdd1.join(rdd2)
rdd3.count()
println(rdd3.dependencies) // 宽依赖 List(org.apache.spark.OneToOneDependency@1af1cf17)

val rdd4:RDD[(Int,(String,String))]=rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
rdd4.count()
println(rdd4.dependencies) // 窄依赖 List(org.apache.spark.OneToOneDependency@6fcd5026)

Thread.sleep(100000)
sc.stop()

输出：
cogroup.Join_Dependency$
List(org.apache.spark.OneToOneDependency@5cb445f2)
List(org.apache.spark.OneToOneDependency@24932b2d)


对应依赖：
rdd3 对应的是宽依赖
rdd4 对应的是窄依赖

Spark UI

在这里插入图片描述

代码解析：

首先join会传入被join的rdd和一个默认的分区器

/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    join(other, defaultPartitioner(self, other))
  }

下面代码，defaultNumPartitions不指定分区数，默认是电脑core数量作为分区数，指定了分区数，rdds.map(_.partitions.length).max 会选取两个rdd中最大分区数，作为join后的分区，最后new HashPartitioner(defaultNumPartitions)返回对应数量的哈希分区器。
在这里，rdd3的join会返回分区数量为10 的哈希分区器，rdd4的join 会返回我们设定的HashPartitioner(分区数目3)

/**
   * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
   *
   * If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
   * as the default partitions number, otherwise we'll use the max number of upstream partitions.
   *
   * When available, we choose the partitioner from rdds with maximum number of partitions. If this
   * partitioner is eligible (number of partitions within an order of maximum number of partitions
   * in rdds), or has partition number higher than or equal to default partitions number - we use
   * this partitioner.
   *
   * Otherwise, we'll use a new HashPartitioner with the default partitions number.
   *
   * Unless spark.default.parallelism is set, the number of partitions will be the same as the
   * number of partitions in the largest upstream RDD, as this should be least likely to cause
   * out-of-memory errors.
   *
   * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
   */
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }

    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }

    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than or equal to the default number of partitions, use the existing partitioner.
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }

此时走到join实际执行方法，由于flatMapValues是窄依赖，我们看下cogroup内部怎么实现的即可

/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

进入cogroup方法中，核心是CoGroupedRDD，根据两个需要join的rdd和一个分区器。由于第一个join的时候，两个rdd都没有分区器，所以在这一步，两个rdd需要先根据传入的分区器进行一次shuffle，走new ShuffleDependency因此第一个rdd3 join是宽依赖。第二个rdd4 join此时已经分好区了，走new OneToOneDependency(rdd)不需要再再进行shuffle了。所以第二个是窄依赖

 /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }

看CoGroupedRDD的getDependencies方法的实现
为什么两个join都会打印OneToOneDependency？还是看CoGroupedRDD的getDependencies的实现，如果有partitioner就会返回OneToOneDependency(rdd3 在join后就会被分配默认的哈希分区器)

override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

    println(rdd1.partitioner)
    println(rdd2.partitioner)
    println(rdd3.partitioner)
输出：
  None
  None
  Some(org.apache.spark.HashPartitioner@a)