Spark编程之基本的RDD算子之zip,zipPartitions,zipWithIndex,zipWithUniqueId
- 1) zip拉链操作
首先来看一下基本的api。
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]
1
自身的RDD的值的类型为T类型,另一个RDD的值的类型为U类型。zip操作将这两个值连接在一起。构成一个元祖值。RDD的值的类型为元祖。
都是第i个值和第i个值进行连接。
zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常
val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
a.zip(b).collect
//可以看到每个值都是对应的。
res1: Array[(Int, Int)] = Array((1,101), (2,102), (3,103), (4,104),
(5,105), (6,106), (7,107), (8,108), (9,109), (10,110), (11,111), (12,112),
(13,113), (14,114), (15,115), (16,116), (17,117), (18,118), (19,119),
(20,120), (21,121), (22,122), (23,123), (24,124), (25,125), (26,126),
(27,127), (28,128), (29,129), (30,130), (31,131), (32,132), (33,133)...123456789
val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
val c = sc.parallelize(201 to 300, 3)
//同样也可以多次进行zip操作,则返回的元祖值包含有多个值。
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect
res12: Array[(Int, Int, Int)] = Array((1,101,201)