Spark coalesce算子

最新推荐文章于 2023-08-17 20:13:21 发布

江湖峰哥

最新推荐文章于 2023-08-17 20:13:21 发布

阅读量764

点赞数 2

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/querydata_boke/article/details/105281372

版权

Spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

该算子可以实现对父RDD进行重分区,并且可以指定重分区时是否要Shuffle

当重分区数大于父RDD分区数,并且指定Shuffle为false时,重分区无效

代码示例:
def main(args: Array[String]): Unit = {
  val sparkSession = SparkSession.builder
   .master("local")
   .appName("appName")
   .getOrCreate()
  val sc = sparkSession.sparkContext
  val rdd1: RDD[String] = sc.parallelize(List(
   "spark1", "spark2", "spark3",
   "spark4", "spark5", "spark6",
   "spark7", "spark8", "spark9"),
   3) #这里指定父RDD分区数为3
  val rdd2: RDD[String] = rdd1.mapPartitionsWithIndex {
   (index, iter) => {
    println()
    var result = List[String]()
    while (iter.hasNext) {
     val str: String = iter.next()
     result = result :+ "父rdd1 partition index = 【" + index + "】, value = " + str
    }
    result.iterator
   }
  }
}

#重新分区数小于父RDD分区数,指定Shuffle为false  
val rdd3: RDD[String] = rdd2.coalesce(2, false)
  val rdd4: RDD[String] = rdd3.mapPartitionsWithIndex {
   (index, iter) => {
    println()
    println("当前分区号为 【" + index + "】")
    while (iter.hasNext) {
     val str: String = iter.next()
     println("子rdd2 partition index = 【" + index + "】, value = " + str);
    }
    iter
   }
  }
  rdd4.collect()
运行结果:
当前分区号为 【0】
子rdd3 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark1
子rdd3 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark2
子rdd3 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark3
当前分区号为 【1】
子rdd3 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark4
子rdd3 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark5
子rdd3 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark6
子rdd3 partition index = 【1】, value = 父rdd1 partition index = 【2】, value = spark7
子rdd3 partition index = 【1】, value = 父rdd1 partition index = 【2】, value = spark8
子rdd3 partition index = 【1】, value = 父rdd1 partition index = 【2】, value = spark9
如下图①所示:窄依赖,没有产生Shuffle

#重新分区数小于父RDD分区数,指定Shuffle为true 
val rdd3: RDD[String] = rdd2.coalesce(2, true)
运行结果:
当前分区号为 【0】
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark1
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark3
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【1】, value = spark4
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【1】, value = spark6
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【2】, value = spark7
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【2】, value = spark9

当前分区号为 【1】
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【0】, value = spark2
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark5
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【2】, value = spark8
如上图②所示:宽依赖,产生了Shuffle

重新分区数等于父RDD分区数的情况略

#重新分区数大于父RDD分区时,指定Shuffle为true
val rdd3: RDD[String] = rdd2.coalesce(4, true)
运行结果:
当前分区号为 【2】 #这里出现了空分区

当前分区号为 【0】
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark2
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【1】, value = spark5
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【2】, value = spark8

当前分区号为 【1】
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【0】, value = spark3
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark6
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【2】, value = spark9

当前分区号为 【3】
子rdd2 partition index = 【3】, value = 父rdd1 partition index = 【0】, value = spark1
子rdd2 partition index = 【3】, value = 父rdd1 partition index = 【1】, value = spark4
子rdd2 partition index = 【3】, value = 父rdd1 partition index = 【2】, value = spark7
如上图③所示,出现了空分区,并且产生了Shuffle

#重新分区数大于父RDD分区时,指定Shuffle为false
val rdd3: RDD[String] = rdd2.coalesce(4, false)
运行结果:
当前分区号为 【0】
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark1
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark2
子rdd2 partition index = 【0】, value = 父rdd1 partition index = 【0】, value = spark3

当前分区号为 【1】
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark4
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark5
子rdd2 partition index = 【1】, value = 父rdd1 partition index = 【1】, value = spark6

当前分区号为 【2】
子rdd2 partition index = 【2】, value = 父rdd1 partition index = 【2】, value = spark7
子rdd2 partition index = 【2】, value = 父rdd1 partition index = 【2】, value = spark8
子rdd2 partition index = 【2】, value = 父rdd1 partition index = 【2】, value = spark9
结论:重分区数大于父RDD分区数,Shuffle指定为false时,重新分区无效,子RDD与父RDD分区结果一致

江湖峰哥

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark coalesce算子

该算子可以对父RDD进行重分区,并且可以指定是否要产生Shuffle代码示例:def main(args: Array[String]): Unit = { val sparkSession = SparkSession.builder .master("local") .appName("appName") .getOrCreate() val sc = spark...
复制链接

扫一扫

专栏目录