Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally

关键字:Spark算子、Spark RDD键值转换、groupByKey、reduceByKey、reduceByKeyLocally

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,

参数numPartitions用于指定分区数;

参数partitioner用于指定分区函数;

 

 
 
  1. scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
  2. rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[89] at makeRDD at :21
  3.  
  4. scala> rdd1.groupByKey().collect
  5. res81: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(2, 1)), (C,CompactBuffer(1)))
  6.  

reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。

参数numPartitions用于指定分区数;

参数partitioner用于指定分区函数;

 
 
  1. scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
  2. rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
  3.  
  4. scala> rdd1.partitions.size
  5. res82: Int = 15
  6.  
  7. scala> var rdd2 = rdd1.reduceByKey((x,y) => x + y)
  8. rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at reduceByKey at :23
  9.  
  10. scala> rdd2.collect
  11. res85: Array[(String, Int)] = Array((A,2), (B,3), (C,1))
  12.  
  13. scala> rdd2.partitions.size
  14. res86: Int = 15
  15.  
  16. scala> var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y) => x + y)
  17. rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[95] at reduceByKey at :23
  18.  
  19. scala> rdd2.collect
  20. res87: Array[(String, Int)] = Array((B,3), (A,2), (C,1))
  21.  
  22. scala> rdd2.partitions.size
  23. res88: Int = 2
  24.  

reduceByKeyLocally

def reduceByKeyLocally(func: (V, V) => V): Map[K, V]

该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]。

 
 
  1. scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
  2. rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
  3.  
  4. scala> rdd1.reduceByKeyLocally((x,y) => x + y)
  5. res90: scala.collection.Map[String,Int] = Map(B -> 3, A -> 2, C -> 1)
  6.  
  7.  

更多关于Spark算子的介绍,可参考spark算子系列文章:

http://blog.csdn.net/ljp812184246/article/details/53895299

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值