Spark 的 coalesce 的利弊及原理

最新推荐文章于 2023-04-18 09:05:52 发布

雾岛与鲸

最新推荐文章于 2023-04-18 09:05:52 发布

阅读量3.1k

点赞数 1

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_36039236/article/details/107696576

版权

spark 专栏收录该内容

21 篇文章 2 订阅

订阅专栏

1、RDD.scala 的 coalesce函数源码

/*
 * numPartitions: 分区数
 * shuffle: 是否进行shuffle(默认不shuffle)
 * partitionCoalescer: Coalesce分区器(用来决定哪些父rdd的分区组成一组，作为一个partitiongroup，也 
 * 即是决定了coalescedrdd的分区情况)
 */
def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    
    // 是否进行shuffle
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

所有的RDD的分区数都是由getPartitions函数来确定分区，所有的RDD都是通过getDependencies()函数来确定依赖关系: 窄依赖和宽依赖。而所有的rdd都是通过compute方法来计算rdd数据的。

下面只介绍不进行shuffle操作的功能

new CoalescedRDD(this, numPartitions, partitionCoalescer)

2、getPartitions 分区分组

默认coalesce函数的partitionCoalescer为空，所以你要想自己实现父RDD分区分组策略也是可以的。对于CoalescedRDD，默认指定分区器为空，那么看一下其getPartitions函数，会使用默认的分区器DefaultPartitionCoalescer。

override def getPartitions: Array[Partition] = {
    // 指定的分区器为空，采用默认的分区器: DefaultPartitionCoalescer
    val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())

    pc.coalesce(maxPartitions, prev).zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.partitions.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
    }
  }

可以看看DefaultPartitionCoalescer分区器的coalesce方法，实际上就是将父RDD的分区分组缩减为指定的分区数,该函数返回的就是Array[PartitionGroup]，每个PartitionGroup代表一组父RDD分区，也代表一个CoalescedRDD的分区。

def coalesce(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {
    val partitionLocs = new PartitionLocations(prev)
    // setup the groups (bins)
    setupGroups(math.min(prev.partitions.length, maxPartitions), partitionLocs)
    // assign partitions (balls) to each group (bins)
    throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)
    getPartitions
  }

一个PartitionGroup实际上就是按照一定的规则组合的父RDD的partition数组，可以看一下该类:

/**
 * ::DeveloperApi::
 * A group of `Partition`s
 * @param prefLoc preferred location for the partition group
 */
@DeveloperApi
class PartitionGroup(val prefLoc: Option[String] = None) {
  val partitions = mutable.ArrayBuffer[Partition]()
  def numPartitions: Int = partitions.size
}

3、getDependencies 血缘

上面说了，CoalescedRDD的getPartitions()方法，也就是完成了父RDD的分区到当前RDD分区的映射关系。这个映射关系的使用实际上就是通过getDependencies方法来调用的。具体如下:

override def getDependencies: Seq[Dependency[_]] = {
    Seq(new NarrowDependency(prev) {
      def getParents(id: Int): Seq[Int] =
        partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices
    })
  }

partitions数组是在RDD类里实现的，其实调用了getPartitions函数。

 /**
   * Get the array of partitions of this RDD, taking into account whether the
   * RDD is checkpointed or not.
   */
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
        partitions_.zipWithIndex.foreach { case (partition, index) =>
          require(partition.index == index,
            s"partitions($index).partition == ${partition.index}, but it should equal $index")
        }
      }
      partitions_
    }
  }

再说回窄依赖 NarrowDependency，其实它的getParents方法就是通过当前分区的id获取一个coalescedRDDPartition，也即一个父RDD分区数组。该数组是通过CoalescedRDD的getPartitions中实现的对父RDD分区分组得到的。

4、compute 计算分区

compute五大特性之一，针对分区的计算函数，对于CoalescedRDD，那么其计算函数的实现如下:

override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

观察上述方法就会发现，是针对CoalescedRDDPartition的计算，这个其实是就是针对一个PartitionsGroup进行计算，也即是一个父RDD的分组。在getPartitions方法里生成的。

到这里就很明显了，CoalescedRDDD的compute方法虽然是针对CoalescedRDD的一个分区计算，实际上是计算的父RDD的一组RDD分区，降低了父RDD 的并行度，所以使用要慎重。

总结

CoalescedRDD的getPartitions函数将父rdd的分区分组，比如父rdd 1000个分区，colaesce降低为10，那就会分为10个partitiongroup, 每个partitiongroup作为该CoalescedRDD的一个分区，由于一个stage的并行度取决于该stage最后的分区数，这点可以看compute方法，一个CoalescedRDD的compute肯定迭代了很多个父分区，可想而知，降低了父rdd计算的并行度，复杂业务降低并行度，肯定是不可取，加上数据量可能很大。

雾岛与鲸

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Spark 的 coalesce 的利弊及原理

1、RDD.scala 的 coalesce函数源码/* * numPartitions: 分区数 * shuffle: 是否进行shuffle(默认不shuffle) * partitionCoalescer: Coalesce分区器(用来决定哪些父rdd的分区组成一组，作为一个partitiongroup，也 * 即是决定了coalescedrdd的分区情况) */def coalesce(numPartitions: Int, shuffle: Boolean = false,
复制链接

扫一扫