Spark中repartition和coalesce的区别与使用场景解析

最新推荐文章于 2024-07-03 07:15:00 发布

墨卿风竹

最新推荐文章于 2024-07-03 07:15:00 发布

阅读量3.2k

点赞数

文章标签： Spark中repartition和coalesce的区别与使用场景解

本文链接：https://blog.csdn.net/qq_43688472/article/details/88738145

版权

repartition和coalesce都是进行RDD的重新分区操作，
那么他们有什么区别与各自合适的使用场景呢，我们来看下边的源码

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
 coalesce(numPartitions, shuffle = true)

  }

可以看到 repartition 内部实现调用的 coalesce 且为coalesce中 shuffle = true的实现

注释中:

If you are decreasing the number of partitions in this RDD, consider using coalesce,
which can avoid performing a shuffle.

如果父RDD为1000分区，filter转换操作后过滤了50%的数据，想把数据重新分布在500分区中，这个时候是去减少分区这个时候使用coalesce(500)
能够避免引起shuffle

分区由少变多，或者在一些不是键值对的ＲＤＤ中想要重新分区的话，就需要使用repartition了

coalesce函数的源码实现


view sourceprint?
<pre class="brush:java; toolbar: true; auto-links: true;">def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]
 
      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions).values
    } else {
      new CoalescedRDD(this, numPartitions)
    }
  }</pre>