03 Spark:RDD转换算子之双Value类型

RDD的转换算子之双Value类型

1. union(otherDataSet)

  1. 作用: 求并集,对源 RDD 和参数 RDD 求并集之后返回一个新的 RDD

  2. 示例:

    scala> val rdd1 = sc.makeRDD(1 to 6)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[95] at makeRDD at <console>:24
    
    scala> val rdd2 = sc.makeRDD(4 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[96] at makeRDD at <console>:24
    
    scala> val rdd = rdd1.union(rdd2)
    rdd: org.apache.spark.rdd.RDD[Int] = UnionRDD[97] at union at <console>:28
    
    scala> rdd.collect
    res53: Array[Int] = Array(1, 2, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9, 10)
    

2. subtract(otherDataSet)

  1. 作用: 计算差集,从原 RDD 中减去原RDDotherDateSet 中的共同的部分。

  2. 示例:

    scala> val rdd1 = sc.makeRDD(1 to 6)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[98] at makeRDD at <console>:24
    
    scala> val rdd2 = sc.makeRDD(4 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[99] at makeRDD at <console>:24
    
    scala> val rdd = rdd1.subtract(rdd2)
    rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[103] at subtract at <console>:28
    
    scala> rdd.collect
    res54: Array[Int] = Array(1, 2, 3)
    

3. intersection(otherDataSet)

  1. 作用: 计算交集,对源 RDD 和参数 RDD 求交集后返回一个新的 RDD

  2. 示例:

    scala> val rdd1 = sc.makeRDD(1 to 6)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[104] at makeRDD at <console>:24
    
    scala> val rdd2 = sc.makeRDD(4 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[105] at makeRDD at <console>:24
    
    scala> val rdd = rdd1.intersection(rdd2)
    rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[111] at intersection at <console>:28
    
    scala> rdd.collect
    res55: Array[Int] = Array(4, 5, 6)
    

4. cartesian(otherDataSet)

  1. 作用: 计算 2 个 RDD 的笛卡尔积,尽量避免使用。

  2. 示例:

    scala> val rdd1 = sc.makeRDD(1 to 6)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[112] at makeRDD at <console>:24
    
    scala> val rdd2 = sc.makeRDD(4 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[113] at makeRDD at <console>:24
    
    scala> val rdd = rdd1.cartesian(rdd2)
    rdd: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[114] at cartesian at <console>:28
    
    scala> rdd.collect
    res56: Array[(Int, Int)] = Array((1,4), (1,5), (1,6), (1,7), (1,8), (1,9), (1,10), (2,4), (3,4), (2,5), (2,6), (3,5), (3,6), (2,7), (2,8), (3,7), (3,8), (2,9), (2,10), (3,9), (3,10), (4,4), (4,5), (4,6), (4,7), (4,8), (4,9), (4,10), (5,4), (6,4), (5,5), (5,6), (6,5), (6,6), (5,7), (5,8), (6,7), (6,8), (5,9), (5,10), (6,9), (6,10))
    

5. zip(otherDataSet)

  1. 作用: 拉链操作,两个 RDD 的元素的数量必须相同,否则会抛出异常。

  2. 示例:

    scala> val rdd1 = sc.makeRDD(1 to 5)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[118] at makeRDD at <console>:24
    
    scala> val rdd2 = sc.makeRDD(6 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[119] at makeRDD at <console>:24
    
    scala> val rdd = rdd1.zip(rdd2)
    rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[120] at zip at <console>:28
    
    scala> rdd.collect
    res58: Array[(Int, Int)] = Array((1,6), (2,7), (3,8), (4,9), (5,10))
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值