Spark算子:RDD基本转换操作(6)–zip、zipPartitions

zip

def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]

zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。

 
 
  1. scala> var rdd1 = sc.makeRDD(1 to 10,2)
  2. rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at :21
  3.  
  4. scala> var rdd1 = sc.makeRDD(1 to 5,2)
  5. rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at :21
  6.  
  7. scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
  8. rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at :21
  9.  
  10. scala> rdd1.zip(rdd2).collect
  11. res0: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E))
  12.  
  13. scala> rdd2.zip(rdd1).collect
  14. res1: Array[(String, Int)] = Array((A,1), (B,2), (C,3), (D,4), (E,5))
  15.  
  16. scala> var rdd3 = sc.makeRDD(Seq("A","B","C","D","E"),3)
  17. rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at makeRDD at :21
  18.  
  19. scala> rdd1.zip(rdd3).collect
  20. java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
  21. //如果两个RDD分区数不同,则抛出异常
  22.  

zipPartitions

zipPartitions函数将多个RDD按照partition组合成为新的RDD,该函数需要组合的RDD具有相同的分区数,但对于每个分区内的元素数量没有要求。

该函数有好几种实现,可分为三类:

  • 参数是一个RDD

def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

这两个区别就是参数preservesPartitioning,是否保留父RDD的partitioner分区信息

映射方法f参数为两个RDD的迭代器。

 
 
  1. scala> var rdd1 = sc.makeRDD(1 to 5,2)
  2. rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at makeRDD at :21
  3.  
  4. scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
  5. rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at makeRDD at :21
  6.  
  7. //rdd1两个分区中元素分布:
  8. scala> rdd1.mapPartitionsWithIndex{
  9. | (x,iter) => {
  10. | var result = List[String]()
  11. | while(iter.hasNext){
  12. | result ::= ("part_" + x + "|" + iter.next())
  13. | }
  14. | result.iterator
  15. |
  16. | }
  17. | }.collect
  18. res17: Array[String] = Array(part_0|2, part_0|1, part_1|5, part_1|4, part_1|3)
  19.  
  20. //rdd2两个分区中元素分布
  21. scala> rdd2.mapPartitionsWithIndex{
  22. | (x,iter) => {
  23. | var result = List[String]()
  24. | while(iter.hasNext){
  25. | result ::= ("part_" + x + "|" + iter.next())
  26. | }
  27. | result.iterator
  28. |
  29. | }
  30. | }.collect
  31. res18: Array[String] = Array(part_0|B, part_0|A, part_1|E, part_1|D, part_1|C)
  32.  
  33. //rdd1和rdd2做zipPartition
  34. scala> rdd1.zipPartitions(rdd2){
  35. | (rdd1Iter,rdd2Iter) => {
  36. | var result = List[String]()
  37. | while(rdd1Iter.hasNext && rdd2Iter.hasNext) {
  38. | result::=(rdd1Iter.next() + "_" + rdd2Iter.next())
  39. | }
  40. | result.iterator
  41. | }
  42. | }.collect
  43. res19: Array[String] = Array(2_B, 1_A, 5_E, 4_D, 3_C)
  44.  
  45.  
  • 参数是两个RDD

def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]

def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]

用法同上面,只不过该函数参数为两个RDD,映射方法f输入参数为两个RDD的迭代器。

 
 
  1. scala> var rdd1 = sc.makeRDD(1 to 5,2)
  2. rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at :21
  3.  
  4. scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
  5. rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at makeRDD at :21
  6.  
  7. scala> var rdd3 = sc.makeRDD(Seq("a","b","c","d","e"),2)
  8. rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[29] at makeRDD at :21
  9.  
  10. //rdd3中个分区元素分布
  11. scala> rdd3.mapPartitionsWithIndex{
  12. | (x,iter) => {
  13. | var result = List[String]()
  14. | while(iter.hasNext){
  15. | result ::= ("part_" + x + "|" + iter.next())
  16. | }
  17. | result.iterator
  18. |
  19. | }
  20. | }.collect
  21. res21: Array[String] = Array(part_0|b, part_0|a, part_1|e, part_1|d, part_1|c)
  22.  
  23. //三个RDD做zipPartitions
  24. scala> var rdd4 = rdd1.zipPartitions(rdd2,rdd3){
  25. | (rdd1Iter,rdd2Iter,rdd3Iter) => {
  26. | var result = List[String]()
  27. | while(rdd1Iter.hasNext && rdd2Iter.hasNext && rdd3Iter.hasNext) {
  28. | result::=(rdd1Iter.next() + "_" + rdd2Iter.next() + "_" + rdd3Iter.next())
  29. | }
  30. | result.iterator
  31. | }
  32. | }
  33. rdd4: org.apache.spark.rdd.RDD[String] = ZippedPartitionsRDD3[33] at zipPartitions at :27
  34.  
  35. scala> rdd4.collect
  36. res23: Array[String] = Array(2_B_b, 1_A_a, 5_E_e, 4_D_d, 3_C_c)
  37.  
  • 参数是三个RDD

def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]

def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]

用法同上面,只不过这里又多了个一个RDD而已。

转载请注明:lxw的大数据田地 » Spark算子:RDD基本转换操作(6)–zip、zipPartitions

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值