Spark源码-spark算子-2-shuffle类算子

zdaiqing

已于 2022-08-26 18:44:55 修改

阅读量633

点赞数

分类专栏： Spark 大数据源码文章标签： spark 大数据 scala 分布式

于 2022-08-17 17:17:03 首次发布

本文链接：https://blog.csdn.net/m0_37817767/article/details/126390378

版权

源码同时被 3 个专栏收录

29 篇文章 2 订阅

订阅专栏

大数据

26 篇文章 0 订阅

订阅专栏

Spark

25 篇文章 2 订阅

订阅专栏

1.概述

说明：当前spark源码分析基于spark版本2.11进行；

2.去重算子

2.1.distinct

RDD的去重算子调用，可以指定分区数，也可以不指定分区数；
当不指定分区数时，使用调用RDD的分区数作为distinct后RDD的分区数；
distinct()算子底层通过调用 distinct(numPartitions: Int)实现去重效果；
distinct通过reduceByKey分组，然后提取分组的key实现去重效果；
distinct算子的shuffle发生在reduceByKey阶段；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
	
  //通过reduceByKey分组，然后提取分组后的key值实现去重
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

  //使用调用RDD的分区数作为distinct后的RDD的分区数
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }
}

3.聚合算子

常用聚合算子的区别：

	reduceByKey	groupByKey	aggregateByKey	combineByKey
支持map端聚合	是	否	是	是
指定分组内聚合初始值			是
指定创建分组内聚合初始值的函数				是
指定分组内聚合方式的函数	是		是	是
指定分组间聚合方式的函数	是		是	是

3.1.复用性函数

3.1.1.默认分区器

默认分区器选择：

如果RDD现有的最大分区器是合格的，或者它的分区数大于默认的分区数，则使用现有的分区器;
否则以默认分区数构建hash分区器;

默认分区数确定：

配置文件有设置默认分区数参数spark.default.parallelism，取默认分区数；
否则使用rdd中最大分区数；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
    
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    //过滤分区数大于0的分区器
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

    //分区数最大的分区器
    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }

    //默认分区数：配置文件有设置默认分区数参数，取默认分区数；否则使用rdd中最大分区数
    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }

    // 如果现有的最大分区器是合格的，或者它的分区数目大于默认的分区数目，则使用现有的分区器，否则以默认分区数构建hash分区器
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }
}

3.1.1.combineByKeyWithClassTag

默认支持map端聚合；（mapSideCombine默认值为true，当调用combineByKeyWithClassTag时如果不传mapSideCombine，默认值即生效）
当参数指定分区器时，使用指定分区器；
当参数指定分区数时，使用hash分区器，分区数根据参数确定；
当参数没有指定分区器，没有指定分区数时，使用默认分区器；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
  
  //使用默认分区器，默认支持map端聚合
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self))
  }
  
  //使用hash分区器，默认支持map端聚合
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      new HashPartitioner(numPartitions))
  }
   
  //使用指定分区器，默认支持map端聚合
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,				//函数用于创建聚合的初始值
      mergeValue: (C, V) => C,			//将新值合并到聚合结果中的函数
      mergeCombiners: (C, C) => C,	//用于合并多个mergeValue函数的输出的函数
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    
    //构建数据聚合器
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    
    //判断调用combineByKeyWithClassTag的RDD的分区器和参数分区器是否一致
    if (self.partitioner == Some(partitioner)) {
      //构建MapPartitionsRDD
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      //构建ShuffledRDD，指定数据聚合器、map端聚合标识
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }
}

3.2.reduceByKey

分区器的确定：

分区原理同combineByKeyWithClassTag大同小异；

底层实现原理：

reduceByKey算子最后都是通过reduceByKey(partitioner: Partitioner, func: (V, V) => V)底层调用combineByKeyWithClassTag实现；
调用时制指定createCombiner、mergeValue、mergeCombiners、partitioner；

特别说明：

reduceByKey支持map端聚合；原因见combineByKeyWithClassTag分析；
分组内外聚合方式一致，调用函数时必须指定；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
    
  //使用指定分区器，底层调用combineByKeyWithClassTag实现
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope 	 {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

  //使用hash分区器
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

  //使用默认分区器
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
}

3.3.groupByKey

分区器的确定：

分区原理同combineByKeyWithClassTag大同小异；

底层实现原理：

groupByKey算子最后都是通过groupByKey(partitioner: Partitioner)底层调用combineByKeyWithClassTag实现；调用时指定不支持map端聚合；

特别说明：

groupByKey与reduceByKey的区别在是否支持map端聚合：groupByKey不支持，reduceByKey支持；
reduceByKey需要指定分组内外聚合方式，groupByKey默认累加，不需要调用时指定；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
    
  //底层通过combineByKeyWithClassTag实现，默认不支持map端聚合
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }
    
  //使用hash分区器
  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
  }
    
  //使用默认分区器
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }
}

3.4.groupBy

分区原理同combineByKeyWithClassTag大同小异；
底层调用groupByKey实现结果聚合；
分组方式由groupBy调用参数指定；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  //使用默认分区器
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

  //使用hash分区器：分区数参数指定
  def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

  //由分组函数指定分组key，最后由groupByKey实现聚合（指定分区器）
  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }
}

3.5.aggregateByKey

总结：

分区原理同combineByKeyWithClassTag大同小异；
底层调用combineByKeyWithClassTag实现结果聚合；
需要指定每个分组聚合的初始值；
默认支持map端聚合；

特别说明：

aggregateByKey同reduceByKey的区别：
- aggregateByKey需要指定每个分组聚合的初始值，reduceByKey不需要；
- aggregateByKey需要指定分组内外2种聚合方式，reduceByKey只需要指定一种，内外通用；
aggregateByKey同groupByKey的区别：
- aggregateByKey需要指定每个分组聚合的初始值，groupByKey不需要；
- aggregateByKey支持map端聚合，groupByKey不支持；
- aggregateByKey需要指定分组内外2种聚合方式，groupByKey不需要指定；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  //zeroValue：每个分组的初始值
  //seqOp：分组内聚合方式
  //combOp：分组间聚合方式
  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    //调用combineByKeyWithClassTag算子
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }
  
  //使用hash分区器：分区数参数指定
  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
  }
  
  //使用默认分区器
  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }
}

3.6.combineByKey

默认支持map端聚合；
调用函数时必须指定createCombiner、mergeValue、mergeCombiners；
底层调用combineByKeyWithClassTag实现；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  //使用默认分区器，默认支持map端聚合
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
  }
  
  //使用hash分区器，默认支持map端聚合
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
  } 
  
  //使用指定分区器，默认支持map端聚合
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
  }
}

4.排序算子

4.1.sortByKey

支持设置排序方式以及分区数；默认升序排列、使用调用RDD的分区数；
分区器为RangePartitioner；
构建ShuffleRDD，设置排序方式、分区器；

class OrderedRDDFunctions[K : Ordering : ClassTag,
                          V: ClassTag,
                          P <: Product2[K, V] : ClassTag] @DeveloperApi() (
    self: RDD[P])
  extends Logging with Serializable {
    
  //ascending：升序；默认升序；
	def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    //分区器：RangePartitioner
    val part = new RangePartitioner(numPartitions, self, ascending)
    //构建ShuffleRDD，设置排序方式
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }    
}

4.2.sortBy

需要指定排序依据：根据什么进行排序；
底层调用sortByKey执行；
支持设置排序方式以及分区数；默认升序排列、使用调用RDD的分区数；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  def sortBy[K](
      f: (T) => K,//需要指定排序依据：根据什么进行排序；
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
        
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)//调用sortByKey实现
        .values
  }
}

5.重分区算子

repartition底层调用coalesce实现重分区能力；
repartition一定会进行shuffle操作；coalesce默认不进行shuffle，但是可以通过设置进行shuffle；
coalesce通过增加shuffle操作避免分区数减少引起的OOM；
repartiton更适合分区数减少的情况；

5.1.coalesce

支持指定是否进行shuffle；默认不进行shuffle操作；
无论是否进行shuffle，最后都会返回CoalescedRDD；
当需要进行shuffle时，会新增一个shuffle操作，然后再返回CoalescedRDD；
新增一个shuffle操作，保证上游操作任然时分布式并行的，不会因为分区减少导致极端情况；
- 比如：分区减少为只有一个分区，这将导致所有数据缓存到一个节点上，极有可能导致OOM；
- 新增shuffle操作，分区减少将在shuffle中完成，上游还是并行进行的；
在进行shuffle时，会先将上游输出数据均匀的根据shuffle的并行度进行分布；然后使用hash分区器进行分区；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  def coalesce(numPartitions: Int, shuffle: Boolean = false,//默认不进行shuffle
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      //定义数据在新的分区中的分布
      //从一个随机分区开始，在输出分区中均匀分布元素
      //index：上游RDD的分区索引；items上游index分区的数据迭代器
      val distributePartition = (index: Int, items: Iterator[T]) => {
        //0~numPartitions范围内随机数作为开始位置
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          //从position + 1开始均匀分布数据
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // 新增了一个shuffle操作，使用hash分区器进行分区
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }
}

5.2.repartition

底层调用coalesce进行能力实现；
在调用时指定必须进行shuffle操作；
因为一定会进行shuffle操作，更适合分区数量减少的情况；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    //默认必须进行shuffle
    coalesce(numPartitions, shuffle = true)
  }
}

6.集合或者表操作算子

6.1.intersection-RDD交集

实现原理：

2个rdd都通过map进行转换，rdd中每个元素都转为(v, null)元组；v为rdd中的元素；
使用cogroup将2个转换后的rdd根据key进行分组，每个key生成一个rdd，生成的rdd中包含一个元组，其中包含该键以及其他键的值列表；
判断调用RDD和other中key同时存在值列表，同时存在则代表2个RDD都存在该元素；

分区器：

可以不使用分区器；也可以指定分区器；
如果参数传递了分区数，则使用hash分区器；分区数为传递的参数；

特别说明：

2个RDD的元素的数据类型应该一致；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  def intersection(other: RDD[T]): RDD[T] = withScope {
    //cogroup：对于调用RDD和other中的每个键k，返回一个结果RDD，其中包含一个元组，其中包含该键以及其他键的值列表
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
    		//如果调用RDD和other中key同时存在值列表，代表2个RDD都存在该元素
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }
  
  //使用指定分区器
  def intersection(
      other: RDD[T],
      partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }
  
  //使用hash分区器
  def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    intersection(other, new HashPartitioner(numPartitions))
  }
}

6.2.subtractByKey-RDD取差

功能：

返回一个RDD，其中包含来自this的键不在other中的对；

分区器：

参数指定分区器，使用指定分区器；
参数指定分区数，根据分区数构架hash分区器使用；
参数分区数和分区器都不指定，使用调用RDD的分区器或以其分区数构建一个hash分区器；

实现原理：

构建SubtractedRDD；

特别说明：

适用于pairRDD；及k-v类型RDD；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
    
  //使用调用RDD的分区器或以其分区数构建一个hash分区器
	def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)] = self.withScope {
    subtractByKey(other, self.partitioner.getOrElse(new HashPartitioner(self.partitions.length)))
  }    
   
  //使用hash分区器
  def subtractByKey[W: ClassTag](
      other: RDD[(K, W)],
      numPartitions: Int): RDD[(K, V)] = self.withScope {
    subtractByKey(other, new HashPartitioner(numPartitions))
  }
    
  //使用指定分区器
  def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)] = self.withScope {
    new SubtractedRDD[K, V, W](self, other, p)
  }
}

6.3.subtract-RDD取差

功能：

返回一个RDD，该RDD包含该RDD中不存在于其他RDD中的元素；

分区器：

参数指定分区器，使用指定分区器；
- 调用前后RDD分区器相同，使用调用前RDD的分区器，根据分区器参数重写分区器分区数和分区获取函数；
- 调用前后RDD分区器不同，使用参数指定分区器；
参数指定分区数，根据分区数构架hash分区器使用；
参数分区数和分区器都不指定，使用调用RDD的分区器或以其分区数构建一个hash分区器；

实现原理：

底层调用subtractByKey实现RDD去差能力；

特别说明：

subtract和subtractByKey的区别在于使用对象不同：subtractByKey适用于pairRDD；subtract适用于单元素RDD；

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
  
  //使用调用RDD的分区器或以其分区数构建一个hash分区器
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }
  
  //使用指定参数构建的hash分区器
  def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    subtract(other, new HashPartitioner(numPartitions))
  }
  
  def subtract(
      other: RDD[T],
      p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    if (partitioner == Some(p)) {//调用前后分区器相同
      //重写分区数和分区获取方法：参数指定分区器的分区数和分区方式有可能发生变化
      val p2 = new Partitioner() {
        override def numPartitions: Int = p.numPartitions
        override def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)
      }
      
      this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys
    } else {//调用前后分区器不同，使用参数指定分区器
      this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys
    }
  }
}

6.4.join

功能：

返回一个包含this和other中key匹配的所有元素对的RDD。每一对元素将以(k， (v1, v2))元组的形式返回；

分区器：

参数指定了分区器，使用指定分区器；
参数指定了分区数量，以该分区数构建hash分区器使用；
没有指定分区器和分区数的情况下，使用默认分区数；

实现原理：

cogroup：
- 将2个rdd根据key进行分组，每个key生成一个rdd，其中包含一个元组（k,(v1,v2)），其中包含该键以及其他键的值列表。
flatMapValues：
- 获取每个返回RDD的value；
- 轮询返回RDD的value，以(v1, v2)元组的形式返回一个(v1, v2)集合；
- 最后会返回一个新的(k， (v1, v2))元组RDD；

特别说明：

适用于pairRDD（k-v数据类型RDD）；
cogroup在融合2个RDD时，只有2个RDD都存在的key，其数据才会融合到最终的rdd中；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
    
  //使用指定分区器
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>//pair：（rdd1的可以的value，rdd2的key的value）
      //pair._2.iterator：要求右边RDD的value在对应key上也有值
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
  
  //使用默认分区器
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    join(other, defaultPartitioner(self, other))
  }
    
  //使用参数指定的分区数构建的hash分区器
  def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
    join(other, new HashPartitioner(numPartitions))
  }
}

6.5.leftOuterJoin

功能：

执行this和other的左外连接

分区器：

分区器同join大同小异；

原理：

cogroup：
- 将2个rdd根据key进行分组，每个key生成一个rdd，其中包含一个元组（k,(v1,v2)），其中包含该键以及其他键的值列表。
flatMapValues：
- 获取每个返回RDD的value；
- 判断右边RDD在对应key上是否有值，无值，标识为None；
- 轮询返回RDD的value，以(v1, v2)元组的形式返回一个(v1, v2)集合；
- 最后会返回一个新的(k， (v1, v2))元组RDD；

特别说明：

适用于pairRDD（k-v数据类型RDD）；
cogroup在融合2个RDD时，以左边RDD为主：
- 左边RDD存在的key，右边RDD匹配不上时，以None标识；
- 右边RDD存在的key，左边RDD匹配不上时，不计入最终返回RDD中；

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {

  //使用指定分区器
  def leftOuterJoin[W](
      other: RDD[(K, W)],
      partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues { pair =>
      if (pair._2.isEmpty) {//当右边RDD在对应key上没有value时，在元组中标识为None
        pair._1.iterator.map(v => (v, None))
      } else {//在对应key上左右rdd都有value
        for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
      }
    }
  }  
   
  //使用默认分区器 
  def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] = self.withScope {
    leftOuterJoin(other, defaultPartitioner(self, other))
  }
    
  //使用参数指定的分区数构建的hash分区器
  def leftOuterJoin[W](
      other: RDD[(K, W)],
      numPartitions: Int): RDD[(K, (V, Option[W]))] = self.withScope {
    leftOuterJoin(other, new HashPartitioner(numPartitions))
  }
}