每天学点Spark源码--coalesce

最新推荐文章于 2022-09-24 19:44:43 发布

myCity_NJ

最新推荐文章于 2022-09-24 19:44:43 发布

阅读量327

点赞数 1

分类专栏：每天学点Spark源码

本文链接：https://blog.csdn.net/myCity_NJ/article/details/79555040

版权

每天学点Spark源码专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1、coalesce(20180314)

/**  此段为贴源码
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * Note: With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions).values
    } else {
      new CoalescedRDD(this, numPartitions)
    }
  }

shuffle 洗牌 ; coalesce 合并 ; drastic 剧烈的 ; narrow dependency 窄依赖 ;

用法：RDD N个分区 --> M个分区

1、N>M 但差距不大，shuffle不用设置(默认false)，类似于repartition方法

但差距很大比如M=1，如果不设置shuffle，可能计算只在一个Stage中进行，性能不高，开了shuffle就进行了并行计算

2、N<M N中分布不均，某几个分区特别大，开shuffle(ture)，重新按哈希均匀分区到M

myCity_NJ

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
每天学点Spark源码--coalesce

1、coalesce(20180314)/** 此段为贴源码 * Return a new RDD that is reduced into `numPartitions` partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 pa...
复制链接

扫一扫