spark reducebykey和groupbykey 算子的区别

最新推荐文章于 2024-11-02 10:09:29 发布

KEVIN_WANG333

最新推荐文章于 2024-11-02 10:09:29 发布

阅读量256

点赞数 1

文章标签： spark scala 大数据

本文链接：https://blog.csdn.net/KEVIN_WANG333/article/details/126930640

版权

1.从Shuffle的角度

reduceByKey 和 groupByKey都存在shuffle操作，但是reduceByKey可以在shuffle之前对分区内相同key的数据集进行预聚合（combine）功能，这样会较少落盘的数据量，而groupByKey只是进行分组，不存在数据量减少的问题，reduceByKey性能比较高。
在这里插入图片描述

2.从功能的角度

reduceByKey 包含分组和聚合的功能。而GroupByKey 只能分组，不能聚合，所以在分组聚合的场合下，推荐使用reduceByKey，如果仅仅是分组而不需要聚合。那么还是只能使用groupByKey。

在这里插入图片描述

scala> val words=Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)
 
scala> val wordPairsRdd=sc.parallelize(words).map(word=>(word,1))
wordPairsRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:26   //org.apache.spark.rdd.RDD[(String, Int)]
 
//运用reduceByKey方法
scala> val wordCountsWithReduce=wordPairsRdd.reduceByKey(_+_)   //这里的wordPairsRdd.reduceByKey(_+_)等价于wordPairsRdd.reduceByKey((a,b)=>a+b)，把每个(key,value-list)中值的列表做加总求和
wordCountsWithReduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:25
 
//最终结果输出
scala> wordCountsWithReduce.foreach(println)
(one,1)
(two,2)
(three,3)
 
 
//运用groupByKey方法                                                                           
scala> val wordCountsWithGroup=wordPairsRdd.groupByKey()
wordCountsWithGroup: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[3] at groupByKey at <console>:25      //org.apache.spark.rdd.RDD[(String, Iterable[Int])]
 
scala> wordCountsWithGroup.foreach(println)
(two,CompactBuffer(1, 1))
(three,CompactBuffer(1, 1, 1))
(one,CompactBuffer(1))
 
scala> wordCountsWithGroup.map(t=>(t._1,t._2.sum))   //把RDD中的每一个元素依次取出进行遍历；第一个元素：(two,CompactBuffer(1, 1))=>赋值给t，此时t._1代表two，t._2代表CompactBuffer(1, 1)，t._2.sum=(1+1)=2
res7: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26
 
//最终结果输出
scala> wordCountsWithGroup.map(t=>(t._1,t._2.sum)).foreach(println)
(three,3)
(two,2)
(one,1)
 
scala>