Spark repartition vs coalesce

MilkyTea'Ou

已于 2022-03-23 16:33:17 修改

阅读量2.1k

点赞数

分类专栏： spark 文章标签： spark

于 2022-03-23 16:14:56 首次发布

本文链接：https://blog.csdn.net/micro_msdn/article/details/123688135

版权

spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

repartition vs coalesce

repartition vs coalesce

repartition vs coalesce

总结：
1、repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.
2、One important point to note is, Spark repartition() and coalesce() are very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible
3、Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce
4、repartition会涉及到shuffle，而coalesce不会进行shuffle，假设父rdd 1000分区，然后调用coalesce(100)，实际上就是将父rdd的1000分区分成100组，每组10个，叫做partitionGroup，每个partitionGroup作为coalesced rdd的一个分区，在compute方法中迭代处理，以此来避免shuffle；
5、如果是剧烈的聚合，比如numPartitions=1，repartition在多个结点计算完之后，再shuffle到一个结点，而coalesce把计算放到1个节点上完成计算，这种场景用repartition更合适；


/**
   * Returns a new Dataset that has exactly `numPartitions` partitions, when the fewer partitions
   * are requested. If a larger number of partitions is requested, it will stay at the current
   * number of partitions. Similar to coalesce defined on an `RDD`, this operation results in
   * a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not
   * be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can call repartition. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @group typedrel
   * @since 1.6.0
   */
   def coalesce(numPartitions: Int): Dataset[T] = withTypedPlan {
    Repartition(numPartitions, shuffle = false, logicalPlan)
  }
  case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
  extends RepartitionOperation {
  require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")

  override def partitioning: Partitioning = {
    require(shuffle, "Partitioning can only be used in shuffle.")
    numPartitions match {
      case 1 => SinglePartition
      case _ => RoundRobinPartitioning(numPartitions)
    }
  }
  override protected def withNewChildInternal(newChild: LogicalPlan): Repartition =
    copy(child = newChild)
}

源码求证：
1、第一个问题coalesce函数的执行逻辑是什么？