RDD之双value类型交互

spark 双value 类型交互
 union(otherDataset) 案例
1. 作用:对源RDD和参数RDD求并集后返回一个新的RDD
2. 需求:创建两个RDD,求并集

union() 源码:
 /**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T] = withScope {
    sc.union(this, other)
  }
  
  demo:
  val rdd1 = sc.parallelize(1 to 5)
  val rdd2 = sc.parallelize(4 to 8)
  val rdd3 =rdd1.union(rdd2)
result:
rdd3.collect=
Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)

 subtract (otherDataset) 案例
1. 作用:计算差的一种函数,去除两个RDD中相同的元素,不同的RDD将保留下来
2. 需求:创建两个RDD,求第一个RDD与第二个RDD的差集

源码: 
/**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be <= us.
   */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }
  
  demo:
  val rdd1 = sc.parallelize(1 to 5)
  val rdd2 = sc.parallelize(4 to 6)
  val rdd3 = rdd1.subtract(rdd2)
  
  result:
   Array[Int] = Array(1, 2, 3)
   
 intersection(otherDataset) 案例
 1. 作用:对源RDD和参数RDD求交集后返回一个新的RDD
2. 需求:创建两个RDD,求两个RDD的交集
 源码:
   /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  val rdd = sc.parallelize(1 to 6)
  val rdd2 = sc.parallelize(4 to 7)
  val rdd3 = rdd.intersection(rdd2)
  
  result:
   Array[Int] = Array(4, 5, 6)

  cartesian(otherDataset) 案例
1. 作用:笛卡尔积(尽量避免使用)
2. 需求:创建两个RDD,计算两个RDD的笛卡尔积
val rdd =sc.parallelize( 1 to 5)
val rdd2=sc.parallelize(6 to 10)
val rdd3 = rdd.cartesian(rdd2)
result:
Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),
 (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))


源码:
/**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }
  
 zip(otherDataset)案例
1. 作用:将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
2. 需求:创建两个RDD,并将两个RDD组合到一起形成一个(k,v)RDD
源码:
/**
   * Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
   * second element in each RDD, etc. Assumes that the two RDDs have the *same number of
   * partitions* and the *same number of elements in each partition* (e.g. one was made through
   * a map on the other).
   */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
      new Iterator[(T, U)] {
        def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
          case (true, true) => true
          case (false, false) => false
          case _ => throw new SparkException("Can only zip RDDs with " +
            "same number of elements in each partition")
        }
        def next(): (T, U) = (thisIter.next(), otherIter.next())
      }
    }
  }
  
 demo:
 val rdd1 = sc.parallelize( Array(1,2,"big"))
 val rdd2 = sc.parallelize(Array("one","two","cat"))
 val rdd3= rdd1.zip(rdd2)
  
  result: 
  Array[(Any, String)] = Array((1,one), (2,two), (big,cat))

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值