1.从Shuffle的角度
reduceByKey 和 groupByKey都存在shuffle操作,但是reduceByKey可以在shuffle之前对分区内相同key的数据集进行预聚合(combine)功能,这样会较少落盘的数据量,而groupByKey只是进行分组,不存在数据量减少的问题,reduceByKey性能比较高。
2.从功能的角度
reduceByKey 包含分组和聚合的功能。而GroupByKey 只能分组,不能聚合,所以在分组聚合的场合下,推荐使用reduceByKey,如果仅仅是分组而不需要聚合。那么还是只能使用groupByKey。
scala> val words=Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRdd=sc.parallelize(words).map(word=>(word,1))
wordPairsRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:26 //org.apache.spark.rdd.RDD[(String, Int)]
//运用reduceByKey方法
scala> val wordCountsWithReduce=wordPairsRdd.reduceByKey(_+_) //这里的wordPairsRdd.reduceByKey(_+_)等价于wordPairsRdd.reduceByKey((a,b)=>a+b),把每个(key,value-list)中值的列表做加总求和
wordCountsWithReduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:25
//最终结果输出
scala> wordCountsWithReduce.foreach(println)
(one,1)
(two,2)
(three,3)
//运用groupByKey方法
scala> val wordCountsWithGroup=wordPairsRdd.groupByKey()
wordCountsWithGroup: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[3] at groupByKey at <console>:25 //org.apache.spark.rdd.RDD[(String, Iterable[Int])]
scala> wordCountsWithGroup.foreach(println)
(two,CompactBuffer(1, 1))
(three,CompactBuffer(1, 1, 1))
(one,CompactBuffer(1))
scala> wordCountsWithGroup.map(t=>(t._1,t._2.sum)) //把RDD中的每一个元素依次取出进行遍历;第一个元素:(two,CompactBuffer(1, 1))=>赋值给t,此时t._1代表two,t._2代表CompactBuffer(1, 1),t._2.sum=(1+1)=2
res7: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26
//最终结果输出
scala> wordCountsWithGroup.map(t=>(t._1,t._2.sum)).foreach(println)
(three,3)
(two,2)
(one,1)
scala>