Spark-Operator-intersection

最新推荐文章于 2024-07-23 20:14:10 发布

MissionLee

最新推荐文章于 2024-07-23 20:14:10 发布

阅读量129

点赞数

分类专栏： Spark 文章标签： scala spark

本文链接：https://blog.csdn.net/qq_26246063/article/details/79819042

版权

Spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

note 1
这里只是将我学习初期笔记拿来分享，没有做太多精细的推理验证，如有错误，希望指正。
note 2
整个算子系列应用的测试数据是相同的，在本系列第一篇Spark-Operator-Map中有完整的测试数据
note 3
因为工作环境如此，我个人使用Java+Scala混合开发，请知悉
note 4
代码版本
    -Spark2.2 
    -Scala2.11

源码

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   *
   * @param partitioner Partitioner to use for the resulting RDD
   */
  def intersection(
      other: RDD[T],
      partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.  Performs a hash partition across the cluster
   *
   * @note This method performs a shuffle internally.
   *
   * @param numPartitions How many partitions to use in the resulting RDD
   */
  def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    intersection(other, new HashPartitioner(numPartitions))
  }

参数

一个RDD
一个可选参数Partitioner[如果某个RDD已经有分区了，会优先使用原分区-未验证]
- Spark提供了几种Partitioner
- Parititoner（下面两个的父类）
- HashParitioner（默认）
- RangePartitioner

两份原始数据

--------sparkbasic.txt
1,a,c,b
2,w,gd,h
3,h,r,x
4,6,s,b
5,h,d,o
6,q,w,e
--------sparkbasic3.txt
3,h,r,x
4,6,s,b
5,h,d,o
6,q,w,e
7,j,s,b
8,h,m,o
9,q,w,c

测试程序

object TestIntersection {
  val ss = SparkSession.builder().master("local").appName("basic").getOrCreate()
  val sc = ss.sparkContext
  sc.setLogLevel("error")
  val rdd = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic.txt")
  val rdd2 = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic3.txt")

  def main(args: Array[String]): Unit = {
    val rdd3 = rdd.intersection(rdd2)
    rdd3.foreach(println)
  }

}

输出结果

6,q,w,e
4,6,s,b
5,h,d,o
3,h,r,x

测试对象的情况

//User用的是 distince&union测试中的User

object TestIntersection {
  val ss = SparkSession.builder().master("local").appName("basic").getOrCreate()
  val sc = ss.sparkContext
  sc.setLogLevel("error")
  val rdd = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic.txt")
  val rdd2 = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic3.txt")

  def main(args: Array[String]): Unit = {
    val rdd3 = rdd.map(r=>new User(r))
    val rdd4= rdd2.map(r=>new User(r))
    val rdd5 = rdd3.intersection(rdd4)
    rdd5.map(_.getName())foreach(println)
  }

}

打印结果

 here equals 
 here equals 
 here equals 
 here equals 
6,q,w,e
4,6,s,b
5,h,d,o
3,h,r,x

结论：在取交集的时候，也是调用的 equals方法

MissionLee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark-Operator-intersection

note 1这里只是将我学习初期笔记拿来分享，没有做太多精细的推理验证，如有错误，希望指正。note 2整个算子系列应用的测试数据是相同的，在本系列第一篇Spark-Operator-Map中有完整的测试数据note 3因为工作环境如此，我个人使用Java+Scala混合开发，请知悉note 4代码版本 -Spark2.2 -Scala2.11源码 /...
复制链接

扫一扫

专栏目录