1. 交集 intersecion
1.1 源码
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.//交集结果将会去重
*
* @note This method performs a shuffle internally.//属于shuffle类算子
*/
//参与计算的两个RDD的元素泛型必须一致,也是返回的RDD的元素泛型
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
源码分析:
- thisRDD.intersection(otherRDD):计算 thisRDD 和 otherRDD 的交集,交集结果将不会包含重复的元素,即使有的元素在两个 RDD 中都出现多次;
- intersection 属于 shuffleDependency 类算子;
- 其内部调用了cogroup算子;
- Note:凡是涉及两个RDD的计算,并且计算是以相同 key分组的数据为对象进行的,那么一定会调用 cogroup(otherDataSet,[numTasks]) 算子。
1.2 代码实例:
val list1 = List(1,2,3,4,5,6,7,7,20)
val list2 = List(4,5,6,7,8,9,10)
val rdd1: RDD[Int] = sc.parallelize(list1 , 3) //3为分区数,默认分区数为2
val rdd2: RDD[Int] = sc.parallelize(list2)
//交集:rdd1交rdd2
rdd1.intersection(rdd2).foreach(prin