Spark RDD 之 repartition/coalesce 源码浅谈

最新推荐文章于 2021-07-08 14:52:53 发布

DPnice

最新推荐文章于 2021-07-08 14:52:53 发布

阅读量1.8k

点赞数

分类专栏： spark 文章标签： Spark repartition coalesce

本文链接：https://blog.csdn.net/DPnice/article/details/80111096

版权

spark 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

repartition：

  /**
    * Return a new RDD that has exactly numPartitions partitions.
    *
    * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
    * a shuffle to redistribute data.
    *
    * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
    * which can avoid performing a shuffle.
    *
    * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
    */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

coalesce ：

  /**
    * Return a new RDD that is reduced into `numPartitions` partitions.
    *
    * This results in a narrow dependency, e.g. if you go from 1000 partitions
    * to 100 partitions, there will not be a shuffle, instead each of the 100
    * new partitions will claim 10 of the current partitions. If a larger number
    * of partitions is requested, it will stay at the current number of partitions.
    *
    * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
    * this may result in your computation taking place on fewer nodes than
    * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
    * you can pass shuffle = true. This will add a shuffle step, but means the
    * current upstream partitions will be executed in parallel (per whatever
    * the current partitioning is).
    *
    * 我理解的是如果将多个分区 要重分区为1个分区的时候 ，可以设置 shuffle = true ;这样可以避免 在一个执行器上
    * 采用Netty通信的方式获取它的Parent RDD的数据，如果shuffle = true，重分区为1 此rdd的父rdd的分区所在的执
    * 行器会将数据发送到一个同一个执行器的分区。
    *
    * @note With shuffle = true, you can actually coalesce to a larger number
    *       of partitions. This is useful if you have a small number of partitions,
    *       say 100, potentially with a few partitions being abnormally large. Calling
    *       coalesce(1000, shuffle = true) will result in 1000 partitions with the
    *       data distributed using a hash partitioner. The optional partition coalescer
    *       passed in must be serializable.
    *
    *       分区异常的大 可以using a hash partitioner重分区 到更多的分区
    */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      /** 从随机分区开始均匀分布元素在输出分区上。 */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position: Int = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      }: Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      // 之前的操作 还是并行的
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

CoalescedRDD:

/**
  * Represents a coalesced RDD that has fewer partitions than its parent RDD
  * This class uses the PartitionCoalescer class to find a good partitioning of the parent RDD
  * so that each new partition has roughly the same number of parent partitions and that
  * the preferred location of each new partition overlaps with as many preferred locations of its
  * parent partitions
  * 每个新分区的优选位置与它的许多优选位置重叠父分区
  * 将父分区尽可能多的分在 同一个 分区  比如 Parent：10 个分区 要收缩成为 一个分区 ，Parent：1-6 分区又在同一个执行器下
  * 那么 新的分区会在 1-6分区 所在位置。其他 Parent：7-10 会采用Netty通信的方式获取它的Parent RDD的数据 到当前位置
  * 因为没有落地到磁盘的操作，避免了磁盘I/O，所以比Shuffle还是要快不少
  *
  * @param prev               RDD to be coalesced
  * @param maxPartitions      number of desired partitions in the coalesced RDD (must be positive)
  * @param partitionCoalescer [[PartitionCoalescer]] implementation to use for coalescing
  */
private[spark] class CoalescedRDD[T: ClassTag](
                                                @transient var prev: RDD[T],
                                                maxPartitions: Int,
                                                partitionCoalescer: Option[PartitionCoalescer] = None)
  extends RDD[T](prev.context, Nil) { // Nil since we implement getDependencies

  require(maxPartitions > 0 || maxPartitions == prev.partitions.length,
    s"Number of partitions ($maxPartitions) must be positive.")
  if (partitionCoalescer.isDefined) {
    require(partitionCoalescer.get.isInstanceOf[Serializable],
      "The partition coalescer passed in must be serializable.")
  }

  override def getPartitions: Array[Partition] = {
    val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())

    pc.coalesce(maxPartitions, prev).zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.partitions.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
    }
  }

  override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    //将多个分区的元素 flatMap 之后 再拼成 一个分区
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

  override def getDependencies: Seq[Dependency[_]] = {
    Seq(new NarrowDependency(prev) {
      def getParents(id: Int): Seq[Int] =
        partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices
    })
  }

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }

实际运用中完全可以根据自己的需要来灵活使用coalesce ；

并行度不够的情况下可以提高分区数；

存在过多的小任务的时候，可以通过收缩合并分区，减少分区的个数，减小任务调度成本，避免Shuffle，比RDD.repartition效率提高不少；

某个分区数据倾斜也可用重分区，repartition提高分区数来缓解倾斜的情况。

例子：

object CoalesceTest extends App {

  val sparkConf = new SparkConf().
    setAppName("CoalesceTest")
    .setMaster("local[6]")

  val spark = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()

  val value: RDD[Int] = spark.sparkContext.parallelize(List(9, 2, 3, 5, 8, 1), 3)

   val s = new Ordering[Int] {
    override def compare(x: Int, y: Int): Int =
      if (x < y) x
      else y
  }
  val coalesceValue: RDD[Int] = value.coalesce(2)(s)

  coalesceValue.foreach(println(_))

}

coalesce 有一个可选参数 (implicit ord: Ordering[T] = null) 这里我的例子自定义排序后并没有排序。分区收缩后元素位置保持不变。

https://blog.csdn.net/u012684933/article/details/51028707

DPnice

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark RDD 之 repartition/coalesce 源码浅谈

repartition： /** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to ...
复制链接

扫一扫

专栏目录