RDD的转换算子之双Value类型
文章目录
1. union(otherDataSet)
-
作用: 求并集,对源
RDD
和参数RDD
求并集之后返回一个新的RDD
。 -
示例:
scala> val rdd1 = sc.makeRDD(1 to 6) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[95] at makeRDD at <console>:24 scala> val rdd2 = sc.makeRDD(4 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[96] at makeRDD at <console>:24 scala> val rdd = rdd1.union(rdd2) rdd: org.apache.spark.rdd.RDD[Int] = UnionRDD[97] at union at <console>:28 scala> rdd.collect res53: Array[Int] = Array(1, 2, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9, 10)
2. subtract(otherDataSet)
-
作用: 计算差集,从原
RDD
中减去原RDD
和otherDateSet
中的共同的部分。 -
示例:
scala> val rdd1 = sc.makeRDD(1 to 6) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[98] at makeRDD at <console>:24 scala> val rdd2 = sc.makeRDD(4 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[99] at makeRDD at <console>:24 scala> val rdd = rdd1.subtract(rdd2) rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[103] at subtract at <console>:28 scala> rdd.collect res54: Array[Int] = Array(1, 2, 3)
3. intersection(otherDataSet)
-
作用: 计算交集,对源
RDD
和参数RDD
求交集后返回一个新的RDD
。 -
示例:
scala> val rdd1 = sc.makeRDD(1 to 6) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[104] at makeRDD at <console>:24 scala> val rdd2 = sc.makeRDD(4 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[105] at makeRDD at <console>:24 scala> val rdd = rdd1.intersection(rdd2) rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[111] at intersection at <console>:28 scala> rdd.collect res55: Array[Int] = Array(4, 5, 6)
4. cartesian(otherDataSet)
-
作用: 计算 2 个
RDD
的笛卡尔积,尽量避免使用。 -
示例:
scala> val rdd1 = sc.makeRDD(1 to 6) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[112] at makeRDD at <console>:24 scala> val rdd2 = sc.makeRDD(4 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[113] at makeRDD at <console>:24 scala> val rdd = rdd1.cartesian(rdd2) rdd: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[114] at cartesian at <console>:28 scala> rdd.collect res56: Array[(Int, Int)] = Array((1,4), (1,5), (1,6), (1,7), (1,8), (1,9), (1,10), (2,4), (3,4), (2,5), (2,6), (3,5), (3,6), (2,7), (2,8), (3,7), (3,8), (2,9), (2,10), (3,9), (3,10), (4,4), (4,5), (4,6), (4,7), (4,8), (4,9), (4,10), (5,4), (6,4), (5,5), (5,6), (6,5), (6,6), (5,7), (5,8), (6,7), (6,8), (5,9), (5,10), (6,9), (6,10))
5. zip(otherDataSet)
-
作用: 拉链操作,两个
RDD
的元素的数量必须相同,否则会抛出异常。 -
示例:
scala> val rdd1 = sc.makeRDD(1 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[118] at makeRDD at <console>:24 scala> val rdd2 = sc.makeRDD(6 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[119] at makeRDD at <console>:24 scala> val rdd = rdd1.zip(rdd2) rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[120] at zip at <console>:28 scala> rdd.collect res58: Array[(Int, Int)] = Array((1,6), (2,7), (3,8), (4,9), (5,10))