spark coalesce和repartition的区别和使用场景

最新推荐文章于 2024-04-15 20:47:34 发布

aoren1305

最新推荐文章于 2024-04-15 20:47:34 发布

阅读量656

点赞数

文章标签：大数据

原文链接：http://www.cnblogs.com/Alcesttt/p/11386049.html

版权

区别：

repartition底层调用的是coalesce方法，默认shuffle

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true) 
}

coalesce方法的shuffle参数默认为false，默认不shuffle

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
    : RDD[T] = withScope {
  if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
    val distributePartition = (index: Int, items: Iterator[T]) => {
      var position = (new Random(index)).nextInt(numPartitions)
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]
 
    // include a shuffle step so that our upstream tasks are still distributed
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions).values
  } else {
    new CoalescedRDD(this, numPartitions)
  }
}

使用场景：

如果你减少分区数，考虑使用coalesce，这样可以避免执行shuffle。但是假如内存不够用，可能会引起内存溢出。

转载于:https://www.cnblogs.com/Alcesttt/p/11386049.html

aoren1305

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark coalesce和repartition的区别和使用场景

区别：repartition底层调用的是coalesce方法，默认shuffledef repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) }coalesce方...
复制链接

扫一扫