Spark Transformation —— coalesce

最新推荐文章于 2021-02-24 10:12:22 发布

搬砖小工053

最新推荐文章于 2021-02-24 10:12:22 发布

阅读量430

点赞数

分类专栏： Spark 文章标签： spark coalesce

本文链接：https://blog.csdn.net/SA14023053/article/details/51993426

版权

Spark 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

coalesce 联合，合并

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

该函数用于将RDD进行重分区，使用HashPartitioner。第一个参数为重分区的数目，第二个为是否进行shuffle，默认为false;

scala> var data = sc.textFile("/qgzang/1.txt")  //创建RDD
16/07/22 11:44:52 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 229.6 KB, free 599.0 KB)
16/07/22 11:44:52 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 19.6 KB, free 618.6 KB)
16/07/22 11:44:52 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.122.1:57099 (size: 19.6 KB, free: 511.4 MB)
16/07/22 11:44:52 INFO spark.SparkContext: Created broadcast 5 from textFile at <console>:27
data: org.apache.spark.rdd.RDD[String] = /qgzang/1.txt MapPartitionsRDD[5] at textFile at <console>:27

scala> data.collect   
16/07/22 11:45:01 INFO scheduler.DAGScheduler: Job 3 finished: collect at <console>:30, took 0.244304 s
res3: Array[String] = Array(hello world, hello spark, hello hive)

scala> data.partitions.size //RDD默认分区为2
res4: Int = 2

scala> var rdd1 = data.coalesce(1) //RDD重新分区为1
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[6] at coalesce at <console>:29

scala> rdd1.partitions.size
res5: Int = 1

scala> var rdd1 = data.coalesce(4)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[7] at coalesce at <console>:29

scala> rdd1.partitions.size
res6: Int = 2//如果重分区的数目大于原来的分区数，那么必须指定shuffle参数为true，//否则，不会应用新的分区

scala> var rdd1 = data.coalesce(4,true) //shuffle指定为true以后，可以应用新的分区
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at coalesce at <console>:29

scala> rdd1.partitions.size
res7: Int = 4

搬砖小工053

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Transformation —— coalesce

coalesce 联合，合并def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T] 该函数用于将RDD进行重分区，使用HashPartitioner。第一个参数为重分区的数目，第二个为是否进行shuffle，默认为false;scala> var d
复制链接

扫一扫

专栏目录