Spark Transformation —— coalesce

coalesce 联合,合并

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

该函数用于将RDD进行重分区,使用HashPartitioner。第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false;

scala> var data = sc.textFile("/qgzang/1.txt")  //创建RDD
16/07/22 11:44:52 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 229.6 KB, free 599.0 KB)
16/07/22 11:44:52 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 19.6 KB, free 618.6 KB)
16/07/22 11:44:52 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.122.1:57099 (size: 19.6 KB, free: 511.4 MB)
16/07/22 11:44:52 INFO spark.SparkContext: Created broadcast 5 from textFile at <console>:27
data: org.apache.spark.rdd.RDD[String] = /qgzang/1.txt MapPartitionsRDD[5] at textFile at <console>:27

scala> data.collect   
16/07/22 11:45:01 INFO scheduler.DAGScheduler: Job 3 finished: collect at <console>:30, took 0.244304 s
res3: Array[String] = Array(hello world, hello spark, hello hive)

scala> data.partitions.size //RDD默认分区为2
res4: Int = 2

scala> var rdd1 = data.coalesce(1) //RDD重新分区为1
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[6] at coalesce at <console>:29

scala> rdd1.partitions.size
res5: Int = 1

scala> var rdd1 = data.coalesce(4)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[7] at coalesce at <console>:29

scala> rdd1.partitions.size
res6: Int = 2//如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true,//否则,不会应用新的分区

scala> var rdd1 = data.coalesce(4,true) //shuffle指定为true以后,可以应用新的分区
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at coalesce at <console>:29

scala> rdd1.partitions.size
res7: Int = 4
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值