Spark中repartition和coalesce的用法

最新推荐文章于 2022-05-05 20:21:41 发布

吃鱼的羊

最新推荐文章于 2022-05-05 20:21:41 发布

阅读量4.6k

点赞数 1

分类专栏： SPARK

SPARK 专栏收录该内容

59 篇文章 6 订阅

订阅专栏

repartition(numPartitions:Int):RDD[T]和coalesce(numPartitions:Int，shuffle:Boolean=false):RDD[T]

他们两个都是RDD的分区进行重新划分，repartition只是coalesce接口中shuffle为true的简易实现，（假设RDD有N个分区，需要重新划分成M个分区）

1）、N<M。一般情况下N个分区有数据分布不均匀的状况，利用HashPartitioner函数将数据重新分区为M个，这时需要将shuffle设置为true。

2）如果N>M并且N和M相差不多，(假如N是1000，M是100)那么就可以将N个分区中的若干个分区合并成一个新的分区，最终合并为M个分区，这时可以将shuff设置为false，在shuffl为false的情况下，如果M>N时，coalesce为无效的，不进行shuffle过程，父RDD和子RDD之间是窄依赖关系。

3）如果N>M并且两者相差悬殊，这时如果将shuffle设置为false，父子ＲＤＤ是窄依赖关系，他们同处在一个Ｓｔａｇｅ中，就可能造成spark程序的并行度不够，从而影响性能，如果在M为1的时候，为了使coalesce之前的操作有更好的并行度，可以讲shuffle设置为true。

总之：如果shuff为false时，如果传入的参数大于现有的分区数目，RDD的分区数不变，也就是说不经过shuffle，是无法将RDDde分区数变多的。

当spark程序中，存在过多的小任务的时候，可以通过 RDD.coalesce方法，收缩合并分区，减少分区的个数，减小任务调度成本，避免Shuffle导致，比RDD.repartition效率提高不少。

rdd.coalesce方法的作用是创建CoalescedRDD，源码如下：

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)

: RDD[T] = withScope {

if (shuffle) {

/** Distributes elements evenly across output partitions, starting from a random partition. */

val distributePartition = (index: Int, items: Iterator[T]) => {

var position = (new Random(index)).nextInt(numPartitions)

items.map { t =>

// Note that the hash code of the key will just be the key itself. The HashPartitioner

// will mod it with the number of total partitions.

position = position + 1

(position, t)

}

} : Iterator[(Int, T)]

// include a shuffle step so that our upstream tasks are still distributed

new CoalescedRDD(

new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),

new HashPartitioner(numPartitions)),

numPartitions).values

} else {

new CoalescedRDD(this, numPartitions)

}

本文只讲默认情况下，不发生Shuffle的时候rdd.coalesce的原理

DAGScheduler进行任务分配的时候，需要调用CoalescedRDD.getPartitions方法，获取CoalescedRDD的分区信息。这个方法的代码如下：

override def getPartitions: Array[Partition] = {

val pc = new PartitionCoalescer(maxPartitions, prev, balanceSlack)

pc.run().zipWithIndex.map {

case (pg, i) =>

val ids = pg.arr.map(_.index).toArray//ids表示CoalescedRDD的某个分区，对应它的parent RDD的所有分区id

new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)

}

在上面的方法中，CoalescedRDD一个分区，对应于它parent RDD的那些分区是由PartitionCoalescer数据结构确定的。在这里不详述PartitionCoalescer类的具体实现，只说这个类起到的作用：

1.保证CoalescedRDD的每个分区基本上对应于它Parent RDD分区的个数相同

2.CoalescedRDD的每个分区，尽量跟它的Parent RDD的本地性形同。比如说CoalescedRDD的分区1对应于它的Parent RDD的1到10这10个分区，但是1到7这7个分区在节点1.1.1.1上，那么 CoalescedRDD的分区1所要执行的节点就是1.1.1.1。这么做的目的是为了减少节点间的数据通信，提升处理能力。

3.CoalescedRDD的分区尽量分配到不同的节点执行

4.Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)（不知道该怎么翻译，只能粘原文了）

CoalescedRDD分区它数据结构表示，它是一个容器，包含了一个的Parent RDD的所有分区。在上面的代码中，创建CoalescedRDDPartition对象的时候，ids参数是一个数组，表示这个CoalescedRDDPartition对应的parent RDD的所有分区id。

CoalescedRDD.compute方法用于生成CoalescedRDD一个分区的数据，源码如下：

override def compute(partition: Partition, context: TaskContext): Iterator[T] = {

partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>

firstParent[T].iterator(parentPartition, context)

}

可见这个方法的作用是将CoalescedRDD一个分区对应它的parent RDD的所有分区数据拼接起来，成为一个新的Iterator。效果如下图表示：

如果CoalescedRDD的一个分区跟它的Parent RDD的分区没有在一个Executor，则需要通过Netty通信的方式，拿到它的Parent RDD的数据，然后再拼接。

采用rdd.coalesce方法修改了分区的个数，虽然由可能需要采用Netty通信的方式获取它的Parent RDD的数据，但是没有落地到磁盘的操作，避免了磁盘I/O，所以比Shuffle还是要快不少。

吃鱼的羊

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Spark中repartition和coalesce的用法

repartition(numPartitions:Int):RDD[T]和coalesce(numPartitions:Int，shuffle:Boolean=false):RDD[T]他们两个都是RDD的分区进行重新划分，repartition只是coalesce接口中shuffle为true的简易实现，（假设RDD有N个分区，需要重新划分成M个分区）1）、N&lt;M。一般情况下N个...
复制链接

扫一扫

专栏目录