Spark 之从cogroup的实现来看join是宽依赖还是窄依赖

文章详细解释了Spark中的Stage划分原则,基于宽依赖和窄依赖。通过示例代码展示了join操作如何导致宽依赖和窄依赖,以及它们如何影响Stage的划分。在无预定义分区器或分区数不匹配时,join会导致数据shuffle,形成宽依赖;而预先分区并使用相同的分区策略可避免shuffle,产生窄依赖。此外,文章还讨论了CoGroupedRDD在依赖管理中的角色。
摘要由CSDN通过智能技术生成

窄依赖会划分在同一个Stage里面,宽依赖会分布在不同的Stage里面,DAGScheduler是根据宽依赖来划分Stage的。

测试代码:

val conf: SparkConf = new SparkConf().setMaster("local[10]").setAppName(this.getClass.getCanonicalName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("Warn")

println(this.getClass.getCanonicalName) // 包名 + 类名


val random=scala.util.Random
val col1=Range(1,50).map(idx=>(random.nextInt(10),s"user$idx"))
val col2=Array((0,"BJ"),(1,"SH"),(2,"GZ"),(3,"SZ"),(4,"TJ"),(5,"CQ"),(6,"HZ"),(7,"NJ"),(8,"WH"),(0,"CD"))

val rdd1:RDD[(Int,String)]=sc.makeRDD(col1)
val rdd2:RDD[(Int,String)]=sc.makeRDD(col2)

val rdd3:RDD[(Int,(String,String))]=rdd1.join(rdd2)
rdd3.count()
println(rdd3.dependencies) // 宽依赖 List(org.apache.spark.OneToOneDependency@1af1cf17)

val rdd4:RDD[(Int,(String,String))]=rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
rdd4.count()
println(rdd4.dependencies) // 窄依赖 List(org.apache.spark.OneToOneDependency@6fcd5026)

Thread.sleep(100000)
sc.stop()
输出:
cogroup.Join_Dependency$
List(org.apache.spark.OneToOneDependency@5cb445f2)
List(org.apache.spark.OneToOneDependency@24932b2d)


对应依赖:
rdd3 对应的是宽依赖
rdd4 对应的是窄依赖

Spark UI

在这里插入图片描述
在这里插入图片描述

代码解析:

  1. 首先join会传入被join的rdd和一个默认的分区器
/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    join(other, defaultPartitioner(self, other))
  }
  1. 下面代码,defaultNumPartitions不指定分区数,默认是电脑core数量作为分区数,指定了分区数,rdds.map(_.partitions.length).max 会选取两个rdd中最大分区数,作为join后的分区,最后new HashPartitioner(defaultNumPartitions)返回对应数量的哈希分区器。
    在这里,rdd3的join会返回分区数量为10 的哈希分区器,rdd4的join 会返回我们设定的HashPartitioner(分区数目3)
/**
   * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
   *
   * If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
   * as the default partitions number, otherwise we'll use the max number of upstream partitions.
   *
   * When available, we choose the partitioner from rdds with maximum number of partitions. If this
   * partitioner is eligible (number of partitions within an order of maximum number of partitions
   * in rdds), or has partition number higher than or equal to default partitions number - we use
   * this partitioner.
   *
   * Otherwise, we'll use a new HashPartitioner with the default partitions number.
   *
   * Unless spark.default.parallelism is set, the number of partitions will be the same as the
   * number of partitions in the largest upstream RDD, as this should be least likely to cause
   * out-of-memory errors.
   *
   * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
   */
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }

    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }

    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than or equal to the default number of partitions, use the existing partitioner.
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }
  1. 此时走到join实际执行方法,由于flatMapValues是窄依赖,我们看下cogroup内部怎么实现的即可
/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
  1. 进入cogroup方法中,核心是CoGroupedRDD,根据两个需要join的rdd和一个分区器。由于第一个join的时候,两个rdd都没有分区器,所以在这一步,两个rdd需要先根据传入的分区器进行一次shuffle,走new ShuffleDependency因此第一个rdd3 join是宽依赖。第二个rdd4 join此时已经分好区了,走new OneToOneDependency(rdd)不需要再再进行shuffle了。所以第二个是窄依赖
 /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }
  1. 看CoGroupedRDD的getDependencies方法的实现
    在这里插入图片描述

  2. 为什么两个join都会打印OneToOneDependency?还是看CoGroupedRDD的getDependencies的实现,如果有partitioner就会返回OneToOneDependency(rdd3 在join后就会被分配默认的哈希分区器)

override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }
    println(rdd1.partitioner)
    println(rdd2.partitioner)
    println(rdd3.partitioner)
输出:
  None
  None
  Some(org.apache.spark.HashPartitioner@a)

总结:

join什么时候是宽依赖什么时候是窄依赖?
由上述分析可以知道,如果需要join的两个表,本身已经有分区器,且分区的数目相同,此时,相同的key在同一个分区内。就是窄依赖。反之,如果两个需要join的表中没有分区器或者分区数量不同,在join的时候需要shuffle,那么就是宽依赖

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值