Spark中repartition和coalesce的区别与使用场景解析

repartition和coalesce都是进行RDD的重新分区操作
那么他们有什么区别与各自合适的使用场景呢,我们来看下边的源码

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
 coalesce(numPartitions, shuffle = true)

  }

可以看到 repartition 内部实现调用的 coalesce 且为coalesce中 shuffle = true的实现

注释中:

  • If you are decreasing the number of partitions in this RDD, consider using coalesce,
  • which can avoid performing a shuffle.

如果父RDD为1000分区,filter转换操作后过滤了50%的数据,想把数据重新分布在500分区中,这个时候是去减少分区这个时候使用coalesce(500)
能够避免引起shuffle

分区由少变多,或者在一些不是键值对的RDD中想要重新分区的话,就需要使用repartition了

coalesce函数的源码实现


view sourceprint?
<pre class="brush:java; toolbar: true; auto-links: true;">def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]
 
      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions).values
    } else {
      new CoalescedRDD(this, numPartitions)
    }
  }</pre>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值