函数原型
1 | def coalesce(numPartitions : Int, shuffle : Boolean = false ) |
2 | ( implicit ord : Ordering[T] = null ) : RDD[T] |
返回一个新的RDD,且该RDD的分区个数等于numPartitions个数。如果shuffle设置为true,则会进行shuffle。
实例
10 | scala> var data = sc.parallelize(List( 1 , 2 , 3 , 4 )) |
11 | data : org.apache.spark.rdd.RDD[Int] = |
12 | ParallelCollectionRDD[ 45 ] at parallelize at <console> : 12 |
14 | scala> data.partitions.length |
17 | scala> val result = data.coalesce( 2 , false ) |
18 | result : org.apache.spark.rdd.RDD[Int] = CoalescedRDD[ 57 ] at coalesce at <console> : 14 |
20 | scala> result.partitions.length |
23 | scala> result.toDebugString |
25 | ( 2 ) CoalescedRDD[ 57 ] at coalesce at <console> : 14 [] |
26 | | ParallelCollectionRDD[ 45 ] at parallelize at <console> : 12 [] |
28 | scala> val result 1 = data.coalesce( 2 , true ) |
29 | result 1 : org.apache.spark.rdd.RDD[Int] = MappedRDD[ 61 ] at coalesce at <console> : 14 |
31 | scala> result 1 .toDebugString |
33 | ( 2 ) MappedRDD[ 61 ] at coalesce at <console> : 14 [] |
34 | | CoalescedRDD[ 60 ] at coalesce at <console> : 14 [] |
35 | | ShuffledRDD[ 59 ] at coalesce at <console> : 14 [] |
36 | +-( 30 ) MapPartitionsRDD[ 58 ] at coalesce at <console> : 14 [] |
37 | | ParallelCollectionRDD[ 45 ] at parallelize at <console> : 12 [] |
从上面可以看出shuffle为false的时候并不进行shuffle操作;而为true的时候会进行shuffle操作。RDD.partitions.length可以获取相关RDD的分区数。