note 1
这里只是将我学习初期笔记拿来分享,没有做太多精细的推理验证,如有错误,希望指正。
note 2
整个算子系列应用的测试数据是相同的,在本系列第一篇Spark-Operator-Map中有完整的测试数据
note 3
因为工作环境如此,我个人使用Java+Scala混合开发,请知悉
note 4
代码版本
-Spark2.2
-Scala2.11
- 源码
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*
* @param partitioner Partitioner to use for the resulting RDD
*/
def intersection(
other: RDD[T],
partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did. Performs a hash partition across the cluster
*
* @note This method performs a shuffle internally.
*
* @param numPartitions How many partitions to use in the resulting RDD
*/
def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
intersection(other, new HashPartitioner(numPartitions))
}
参数
- 一个RDD
- 一个可选参数Partitioner[如果某个RDD已经有分区了,会优先使用原分区-未验证]
- Spark提供了几种Partitioner
- Parititoner(下面两个的父类)
- HashParitioner(默认)
- RangePartitioner
两份原始数据
--------sparkbasic.txt
1,a,c,b
2,w,gd,h
3,h,r,x
4,6,s,b
5,h,d,o
6,q,w,e
--------sparkbasic3.txt
3,h,r,x
4,6,s,b
5,h,d,o
6,q,w,e
7,j,s,b
8,h,m,o
9,q,w,c
测试程序
object TestIntersection {
val ss = SparkSession.builder().master("local").appName("basic").getOrCreate()
val sc = ss.sparkContext
sc.setLogLevel("error")
val rdd = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic.txt")
val rdd2 = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic3.txt")
def main(args: Array[String]): Unit = {
val rdd3 = rdd.intersection(rdd2)
rdd3.foreach(println)
}
}
输出结果
6,q,w,e
4,6,s,b
5,h,d,o
3,h,r,x
- 测试对象的情况
//User用的是 distince&union测试中的User
object TestIntersection {
val ss = SparkSession.builder().master("local").appName("basic").getOrCreate()
val sc = ss.sparkContext
sc.setLogLevel("error")
val rdd = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic.txt")
val rdd2 = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic3.txt")
def main(args: Array[String]): Unit = {
val rdd3 = rdd.map(r=>new User(r))
val rdd4= rdd2.map(r=>new User(r))
val rdd5 = rdd3.intersection(rdd4)
rdd5.map(_.getName())foreach(println)
}
}
打印结果
here equals
here equals
here equals
here equals
6,q,w,e
4,6,s,b
5,h,d,o
3,h,r,x
- 结论: 在取交集的时候,也是调用的 equals方法