【Spark】重分区

最新推荐文章于 2024-09-18 19:16:17 发布

zx_love

最新推荐文章于 2024-09-18 19:16:17 发布

阅读量2.4k

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/zx_blog/article/details/100168367

版权

大数据专栏收录该内容

16 篇文章

订阅专栏

重分区的两种方式（coalesce与reparation）：

dataset（spark2.0以上，dataset/dataframe）：coalesce（shuffle=false）；reparation（shuffle=true，且可按column进行分区）；

rdd：coalesce（默认shuffle=false，可传参数，开启shuffle）；reparation（shuffle=true）。

spark2.10中重分区方法源码：

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T]{
...
}

  def repartition(numPartitions: Int): Dataset[T] = withTypedPlan {
    Repartition(numPartitions, shuffle = true, logicalPlan)
  }

  def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] = withTypedPlan {
    RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, Some(numPartitions))
  }

  def coalesce(numPartitions: Int): Dataset[T] = withTypedPlan {
    Repartition(numPartitions, shuffle = false, logicalPlan)
  }