Spark算子:RDD键值转换操作(4)–cogroup/join

最新推荐文章于 2022-06-16 11:41:24 发布

铭霏

最新推荐文章于 2022-06-16 11:41:24 发布

阅读量603

点赞数

分类专栏： Spark

Spark 专栏收录该内容

39 篇文章 4 订阅

订阅专栏

cogroup

函数原型：最多可以组合4个RDD，可以通过partitioner和numsPartitions设置

def cogroup[W1, W2, W3](other1: RDD[(K, W1)], 
      other2: RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner) :
      RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] 
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], 
      other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int) :
      RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], 
      other2: RDD[(K, W2)], other3: RDD[(K, W3)])
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)],
       partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], 
      numPartitions: Int)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)])
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) :
      RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

输入：

    val data1 = sc.parallelize(List((1, "1.101"), (2, "1.201"),(1, "1.102"), (2, "1.202"),(1, "1.103"), (2, "1.203")))
    val data2 = sc.parallelize(List((1, "2.101"), (2, "2.201"), (3, "2.301"),(1, "2.102"), (2, "2.202"), (3, "2.302")))
    val data3 = sc.parallelize(List((1, "3.101"), (2, "3.201"), (3, "3.303"),(1, "3.102"), (2, "3.202"), (3, "3.303")))
    val result = data1.cogroup(data2, data3)
    result.collect.foreach(println)

输出结果：

scala> result.collect.foreach(println)
(1,(CompactBuffer(1.101, 1.102, 1.103),CompactBuffer(2.102, 2.101),CompactBuffer(3.101, 3.102)))
(2,(CompactBuffer(1.201, 1.202, 1.203),CompactBuffer(2.202, 2.201),CompactBuffer(3.202, 3.201)))
(3,(CompactBuffer(),CompactBuffer(2.301, 2.302),CompactBuffer(3.303, 3.303)))

join

函数原型

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

join相当于SQL中的内关联join，只返回两个RDD根据K可以关联上的结果，join只能用于两个RDD之间的关联，如果要多个RDD关联，多关联几次即可。

var rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
var rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
 
scala> rdd1.join(rdd2).collect
res10: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))