RDD之双value类型交互

最新推荐文章于 2022-12-03 15:45:02 发布

我阳某人的博客

最新推荐文章于 2022-12-03 15:45:02 发布

阅读量275

点赞数

分类专栏： demo 文章标签： spark

本文链接：https://blog.csdn.net/weixin_43278942/article/details/88264442

版权

demo 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

spark 双value 类型交互
 union(otherDataset) 案例
1. 作用：对源RDD和参数RDD求并集后返回一个新的RDD
2. 需求：创建两个RDD，求并集

union() 源码：
 /**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T] = withScope {
    sc.union(this, other)
  }
  
  demo:
  val rdd1 = sc.parallelize(1 to 5)
  val rdd2 = sc.parallelize(4 to 8)
  val rdd3 =rdd1.union(rdd2)
result:
rdd3.collect=
Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)

 subtract (otherDataset) 案例
1. 作用：计算差的一种函数，去除两个RDD中相同的元素，不同的RDD将保留下来
2. 需求：创建两个RDD，求第一个RDD与第二个RDD的差集

源码： 
/**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be &lt;= us.
   */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }
  
  demo:
  val rdd1 = sc.parallelize(1 to 5)
  val rdd2 = sc.parallelize(4 to 6)
  val rdd3 = rdd1.subtract(rdd2)
  
  result:
   Array[Int] = Array(1, 2, 3)
   
 intersection(otherDataset) 案例
 1. 作用：对源RDD和参数RDD求交集后返回一个新的RDD
2. 需求：创建两个RDD，求两个RDD的交集
 源码：
   /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  val rdd = sc.parallelize(1 to 6)
  val rdd2 = sc.parallelize(4 to 7)
  val rdd3 = rdd.intersection(rdd2)
  
  result:
   Array[Int] = Array(4, 5, 6)

  cartesian(otherDataset) 案例
1. 作用：笛卡尔积（尽量避免使用）
2. 需求：创建两个RDD，计算两个RDD的笛卡尔积
val rdd =sc.parallelize( 1 to 5)
val rdd2=sc.parallelize(6 to 10)
val rdd3 = rdd.cartesian(rdd2)
result:
Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),
 (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))


源码：
/**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }
  
 zip(otherDataset)案例
1. 作用：将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同，否则会抛出异常。
2. 需求：创建两个RDD，并将两个RDD组合到一起形成一个(k,v)RDD
源码：
/**
   * Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
   * second element in each RDD, etc. Assumes that the two RDDs have the *same number of
   * partitions* and the *same number of elements in each partition* (e.g. one was made through
   * a map on the other).
   */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
      new Iterator[(T, U)] {
        def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
          case (true, true) => true
          case (false, false) => false
          case _ => throw new SparkException("Can only zip RDDs with " +
            "same number of elements in each partition")
        }
        def next(): (T, U) = (thisIter.next(), otherIter.next())
      }
    }
  }
  
 demo:
 val rdd1 = sc.parallelize( Array(1,2,"big"))
 val rdd2 = sc.parallelize(Array("one","two","cat"))
 val rdd3= rdd1.zip(rdd2)
  
  result: 
  Array[(Any, String)] = Array((1,one), (2,two), (big,cat))

我阳某人的博客

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RDD之双value类型交互

spark 双value 类型交互 union(otherDataset) 案例1. 作用：对源RDD和参数RDD求并集后返回一个新的RDD2. 需求：创建两个RDD，求并集union() 源码： /** * Return the union of this RDD and another one. Any identical elements will appear mult...
复制链接

扫一扫