Spark RDD中的coalesce缩减分区和repartition扩大分区

最新推荐文章于 2024-03-25 16:13:54 发布

liguanghai12

最新推荐文章于 2024-03-25 16:13:54 发布

阅读量1.2k

点赞数 2

分类专栏：大数据文章标签： spark 大数据

本文链接：https://blog.csdn.net/weixin_44563670/article/details/112799231

版权

大数据专栏收录该内容

28 篇文章 1 订阅

订阅专栏

Spark RDD中的coalesce缩减分区和repartition扩大分区

RDD是Spark中重要数据结构，在日常使用如果我们的分区内数量量很小，但是分区数量过大，这会导致Spark的task任务变多，加大资源的使用，另外，如果数据量过大，但是分区数少，excetor执行的任务少，但是每个task任务大，执行的耗时会提高，于是我们考虑一个合适的task任务来取适中的task。
通常我们会用coalesce 来缩减分区，用repartiton来扩大分区
coalesce源码如下：

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
    : RDD[T] = withScope {
  if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
    val distributePartition = (index: Int, items: Iterator[T]) => {
      var position = (new Random(index)).nextInt(numPartitions)
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]

    // include a shuffle step so that our upstream tasks are still distributed
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions).values
  } else {
    new CoalescedRDD(this, numPartitions)
  }
}

rePartition源码如下：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

在coalesce源码中，共有两个参数，其中第一个表示重分区后的分区数量，第二个表示是否执行shuffle过程
根据数量缩减分区,缩减分区,共有两个参数,
1.第一个参数是缩减后的分区数,
2.第二个表示是否执行shuffle过程(false表示表示不执行shuffle过程,true表示执行shuffle过程,默认是不执行shuffle过程)
3.如果不执行shuffle过程,那么分区之间的数据就是,分区之间的数据量是不一样的,几个分区直接并入剩下的分区中,可能产生数据倾斜,
4.如果执行shuffle过程,那么分区之间的数据量是一致的,不会产生数据倾斜
5.如果使用coalesce增大分区, 如果选择不用shuffle,那么分区是没有意义的,并且不会分区,也不会扩大分区,所以使用coalsece扩大分区,必须选择shuffle为true

reParition源码中，底层是通过coalesce来实现的，只不过是把shuffle默认成true，来实现重分区后数据均衡

liguanghai12

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark RDD中的coalesce缩减分区和repartition扩大分区

Spark RDD中的coalesce缩减分区和repartition扩大分区RDD是Spark中重要数据结构，在日常使用如果我们的分区内数量量很小，但是分区数量过大，这会导致Spark的task任务变多，加大资源的使用，另外，如果数据量过大，但是分区数少，excetor执行的任务少，但是每个task任务大，执行的耗时会提高，于是我们考虑一个合适的task任务来取适中的task。通常我们会用coalesce 来缩减分区，用repartiton来扩大分区coalesce源码如下：def coalesc
复制链接

扫一扫