在Spark中专门针对RDD[(K,V)]类型数据集提供了xxxByKey算子实现对RDD[(K,V)]类型针对性实现计算。
groupByKey([ numPartitions ])
类似于MapReduce计算模型。将RDD[(K, V)] 转换为RDD[ (K, Iterable)]
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupByKey.collect
res3: Array[(String, Iterable[Int])] = Array((this,CompactBuffer(1)),
(is,CompactBuff)), (good,CompactBuffer(1, 1)))
groupBy(f:(k,v)=> T)
根据指定的位置进行分组
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1)
res5: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[18] at
groupBy at <console>:26
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1).map(t=>
(t._1,t._2.size)).collect
res6: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
reduceByKey
reduceByKey( func , [ numPartitions ])
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
aggregateByKey
aggregateByKey( zeroValue )( seqOp , combOp , [ numPartitions ])
在(K,V)对的数据集上调用时,返回(K,U)对的数据集,其中每个键的值使用给定的组合函数和中性“零”值进行聚合。允许聚合值类型,即di!与输入值类型不同,同时避免不必要的分配。就像在groupByKey,reduce任务的数量可以通过可选的第二个参数配置。
他比reduceByKey要灵活一些,因为他输入(K,V),可以返回(K,U),V到U数据类型可以变化。
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
sortByKey
sortByKey([ ascending ], [ numPartitions ])
在K实现有序的(K,V)对数据集上调用时,返回(K,V)对数据集按键升序或降序排序,如布尔升序参数中指定的 sortByKey(true/false)默认按照第一个字段 排序 true升序
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortByKey(true).collect
res13: Array[(String, Int)] = Array((good,2), (is,1), (this,1))
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortByKey(false).collect
res14: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
sortBy
sortBy(字段,true/false) 第一个参数代表用第几个字段排序 第二个参数代表升序/降序
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortBy(_._2,false).collect
res18: Array[(String, Int)] = Array((good,2), (this,1), (is,1))
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortBy(t=>t,false).collect
res19: Array[(String, Int)] = Array((this,1), (is,1), (good,2))