Pair RDD

1、创建Pair RDD
当需要把一个普通的RDD转为Pair RDD时,可以用map()函数来实现。例如:

scala> val lines = sc.makeRDD(List(("No pains"),("No gains")))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at makeRDD at <console>:24

scala> val pairs = lines.map(line => (line.split(" ")(0),line))
pairs: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[1] at map at <console>:26

scala> pairs.collect
res0: Array[(String, String)] = Array((No,No pains), (No,No gains))   

2、常见RDD的转化操作

  • reduceByKey(func): 合并含有相同键的值
scala> val r1 = pairs.reduceByKey((x,y) => (x.concat(" " + y)))
r1: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[3] at reduceByKey at <console>:28

scala> r1.collect
res2: Array[(String, String)] = Array((No,No pains No gains))
  • groupByKey: 按照key值分组
scala> val r2 = pairs.groupByKey()
r2: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[4] at groupByKey at <console>:28

scala> r2.collect
res3: Array[(String, Iterable[String])] = Array((No,CompactBuffer(No pains, No gains)))

mapValues(func): 对pair RDD中的每个值应用func不改变键值

scala> val r3 = pairs.mapValues(x => x.concat(" Y "))
r3: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[5] at mapValues at <console>:28

scala> r3.collect
res4: Array[(String, String)] = Array((No,"No pains Y "), (No,"No gains Y "))
  • flatMapValues(func) 对每个值应用返回迭代器函数。

  • keys():返回一个仅包含key值得RDD

scala> val r4 = pairs.keys
r4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at keys at <console>:28

scala> r4.collect
res5: Array[String] = Array(No, No)
  • values(): 返回一个仅包含value值得RDD
scala> val r5 = pairs.values
r5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at values at <console>:28

scala> r5.collect
res6: Array[String] = Array(No pains, No gains)
  • SortByKey(): 返回一个根据键排序的RDD
scala> val r6 = pairs.sortByKey()
r6: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[10] at sortByKey at <console>:28

scala> r6.collect
res7: Array[(String, String)] = Array((No,No pains), (No,No gains))

3、两个RDD的转化操作:

  • substractByKey:删除掉RDD中键与otherRDD中的键相同的元素
scala> val rdd = sc.makeRDD(List((1,2),(3,4),(3,6)))
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[15] at makeRDD at <console>:24

scala> val other = sc.makeRDD(List((3,9)))
other: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at makeRDD at <console>:24

scala> r4.collect
res10: Array[(Int, Int)] = Array((1,2))  
  • join: 对两个RDD进行内连接
scala> val r5 = rdd.join(other)
r5: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[22] at join at <console>:28

scala> r5.collect
res12: Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))
  • rightOuterJoin: 右外连接
scala> rdd.rightOuterJoin(other).collect
res13: Array[(Int, (Option[Int], Int))] = Array((3,(Some(4),9)), (3,(Some(6),9)))

  • leftOuterJoin: 左外链接
scala> rdd.leftOuterJoin(other).collect
res14: Array[(Int, (Int, Option[Int]))] = Array((1,(2,None)), (3,(4,Some(9))), (3,(6,Some(9))))
  • cogroup : 将两个RDD中拥有相同键的数据分组
scala> rdd.cogroup(other).collect
res16: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2),CompactBuffer())), (3,(CompactBuffer(4, 6),CompactBuffer(9))))

4、聚合操作:对具有相同键值的元素进行一些统计是很常见的操作。

  • reduceByKey(func):会为数据集中的每个键进行并行的归约的操作。
  • mapValues(func):对每个value值进行函数操作
scala> val rdd1 = sc.makeRDD(List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))

scala> val r = rdd1.mapValues(x => (x,1))
r: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[35] at mapValues at <console>:26

scala> r.collect
res17: Array[(String, (Int, Int))] = Array((panda,(0,1)), (pink,(3,1)), (pirate,(3,1)), (panda,(1,1)), (pink,(4,1)))

scala> r.reduceByKey((x,y) => (x._1 + y._1,x._2 + y._2)).collect
res20: Array[(String, (Int, Int))] = Array((pirate,(3,1)), (panda,(1,2)), (pink,(7,2)))

这也是一个单词计数的原理,先把每个单词编程键值对并且初始化为1,然后reduceByKey进行统计。

5、combineByKey():最为常用的基于键进行聚合的函数。大多数基于键聚合的函数都是基于它实现的。

原理:

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, 
    mergeCombiners: (C, C) => C) : RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, 
    mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, 
    mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine:
    Boolean = true, serializer: Serializer = null): RDD[(K, C)]

首先,它会遍历分区中的每个元素,每个元素的键要么还没有遇到过,要么和之前的某个键元素相同。如果是一个新元素它会使用一个叫createCombiner()的函数来创建那个键对应的累加器的初始值。
如果已经在当前分区之前出现过,会使用mergeValue方法将改键的累加器对应的当前值与这个新的值进行合并。

scala> var rdd1 = sc.makeRDD(Array(("A",1),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[64] at makeRDD at :21

scala> rdd1.combineByKey(
     |       (v : Int) => v + "_",   
     |       (c : String, v : Int) => c + "@" + v,  
     |       (c1 : String, c2 : String) => c1 + "$" + c2
     |     ).collect
res60: Array[(String, String)] = Array((A,2_$1_), (B,1_$2_), (C,1_))
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值