下面的操作会影响到Spark输出RDD分区(partitioner)的:
cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, partitionBy, sort, mapValues (如果父RDD存在partitioner), flatMapValues(如果父RDD存在partitioner), 和 filter (如果父RDD存在partitioner)。其他的transform操作不会影响到输出RDD的partitioner,一般来说是None,也就是没有partitioner。下面举个例子进行说明:
01 | scala> val pairs = sc.parallelize(List(( 1 , 1 ), ( 2 , 2 ), ( 3 , 3 ))) |
02 | pairs : org.apache.spark.rdd.RDD[(Int, Int)] = |
03 | ParallelCollectionRDD[ 4 ] at parallelize at <console> : 12 |
05 | scala> val a = sc.parallelize(List( 2 , 51 , 2 , 7 , 3 )) |
06 | a : org.apache.spark.rdd.RDD[Int] = |
07 | ParallelCollectionRDD[ 5 ] at parallelize at <console> : 12 |
09 | scala> val a = sc.parallelize(List( 2 , 51 , 2 )) |
10 | a : org.apache.spark.rdd.RDD[Int] = |
11 | ParallelCollectionRDD[ 6 ] at parallelize at <console> : 12 |
13 | scala> val b = sc.parallelize(List( 3 , 1 , 4 )) |
14 | b : org.apache.spark.rdd.RDD[Int] = |
15 | ParallelCollectionRDD[ 7 ] at parallelize at <console> : 12 |
17 | scala> val c = a.zip(b) |
18 | c : org.apache.spark.rdd.RDD[(Int, Int)] = |
19 | ZippedPartitionsRDD 2 [ 8 ] at zip at <console> : 16 |
21 | scala> val result = pairs.join(c) |
22 | result : org.apache.spark.rdd.RDD[(Int, (Int, Int))] = |
23 | FlatMappedValuesRDD[ 11 ] at join at <console> : 20 |
25 | scala> result.partitioner |
26 | res 6 : Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner @ 2 ) |
大家可以看到输出来的RDD result分区变成了HashPartitioner,因为join中的两个分区都没有设置分区,所以默认用到了HashPartitioner,可以看join的实现:
01 | def join[W](other : RDD[(K, W)]) : RDD[(K, (V, W))] = { |
02 | join(other, defaultPartitioner(self, other)) |
05 | def defaultPartitioner(rdd : RDD[ _ ], others : RDD[ _ ]*) : Partitioner = { |
06 | val bySize = (Seq(rdd) ++ others).sortBy( _ .partitions.size).reverse |
07 | for (r <- bySize if r.partitioner.isDefined) { |
08 | return r.partitioner.get |
10 | if (rdd.context.conf.contains( "spark.default.parallelism" )) { |
11 | new HashPartitioner(rdd.context.defaultParallelism) |
13 | new HashPartitioner(bySize.head.partitions.size) |
defaultPartitioner
函数就确定了结果RDD的分区。从上面的实现可以看到,
1、join的两个RDD如果都没有partitioner,那么join结果RDD将使用HashPartitioner;
2、如果两个RDD中其中有一个有partitioner,那么join结果RDD将使用那个父RDD的partitioner;
3、如果两个RDD都有partitioner,那么join结果RDD就使用调用join的那个RDD的partitioner。