Spark学习笔记——转换操作(四)

  • 基础转换操作

  • 键值转换操作

 

键值转换操作

  • cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

  • cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]

  • cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]

  • cogroup[W!, W2](other1: RDD[(K, W1])], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

  • cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

  • cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

  • cogroup[W!, W2, W3](other1: RDD[(K, W1])], other2: RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

  • cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

  • cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

cogroup类似于SQL中的全外连接,返回左右RDD中的记录,关联不上的为空

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24

scala> var rdd3 = sc.makeRDD(Array(("A", "A"), ("E", "E")), 2)
rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[28] at makeRDD at <console>:24

scala> rdd1.cogroup(rdd2).collect
res26: Array[(String, (Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(2),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(d))), (A,(CompactBuffer(1),CompactBuffer(a))), (C,(CompactBuffer(3),CompactBuffer(c))))

scala> rdd1.cogroup(rdd2, rdd3).collect
res27: Array[(String, (Iterable[String], Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(2),CompactBuffer(),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(d),CompactBuffer())), (A,(CompactBuffer(1),CompactBuffer(a),CompactBuffer(A))), (C,(CompactBuffer(3),CompactBuffer(c),CompactBuffer())), (E,(CompactBuffer(),CompactBuffer(),CompactBuffer(E))))
  • join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

  • join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

  • join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

  • fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]

  • fullOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Option[V], Option[W]))]

  • fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], Option[W]))]

  • leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

  • leftOuterJoin[W](other: RDD[(K, W)],  numPartition: Int): RDD[(K, (V, Option[W]))]

  • leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

  • rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

  • rightOuterJoin[W](other: RDD[(K, W)],  numPartition: Int): RDD[(K, (V, Option[W]))]

  • rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

join、fullOuterJoin、leftOuterJoin和rightOuterJoin操作对RDD[K, V]中K值相等的进行连接操作,分别对应内连接、全连接、左连接和有连接,其内部都是通过cogroup实现的。

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24

scala> rdd1.join(rdd2).collect
res28: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

scala> rdd1.leftOuterJoin(rdd2).collect
res29: Array[(String, (String, Option[String]))] = Array((B,(2,None)), (A,(1,Some(a))), (C,(3,Some(c))))

scala> rdd1.rightOuterJoin(rdd2).collect
res30: Array[(String, (Option[String], String))] = Array((D,(None,d)), (A,(Some(1),a)), (C,(Some(3),c)))

scala> rdd1.fullOuterJoin(rdd2)
res31: org.apache.spark.rdd.RDD[(String, (Option[String], Option[String]))] = MapPartitionsRDD[46] at fullOuterJoin at <console>:28

scala> rdd1.fullOuterJoin(rdd2).collect
res32: Array[(String, (Option[String], Option[String]))] = Array((B,(Some(2),None)), (D,(None,Some(d))), (A,(Some(1),Some(a))), (C,(Some(3),Some(c))))
  • subtractByKey[W](other: RDD[(K, W)]): RDD[(K, V)]

  • subtractByKey[W](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

  • subtractByKey[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

subtractByKey操作类似于subtract,区别在于针对的是键值RDD[K, V]

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24

scala> rdd1.subtractByKey(rdd2).collect
res33: Array[(String, String)] = Array((B,2))

scala> rdd2.subtractByKey(rdd1).collect
res34: Array[(String, String)] = Array((D,d))

 

 

 

参考:

[1] 郭景瞻. 图解Spark:核心技术与案例实战[M]. 北京:电子工业出版社, 2017.

展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 技术黑板 设计师: CSDN官方博客
应支付0元
点击重新获取
扫码支付

支付成功即可阅读