Spark 提供了 repartitionAndSortWithinPartitions 算子,首先我们说说这个算子的用处 :
给算子可以通过指定的分区器进行分组,并在分组内排序 。
因此,可以满足我们如下的需求 :
例如 :
例子1. 将rdd数据中相同班级的学生分到一个partition中,并根据分数降序排序
例子2. 相同组合Key分组到同一分区,分区中先按照KEY排序,KEY相同的情况下按照其他键进行排序
首先,从官网上看上函数介绍 :
地址:http://spark.apache.org/docs/latest/rdd-programming-guide.html#working-with-key-value-pairs
可以看到 repartitionAndSortWithinPartitions 主要是通过给定的分区器,将相同KEY的元素发送到指定分区,并根据KEY 进行排排序。Tips: 我们可以按照自定义的排序规则,进行二次排序。
此外,repartitionAndSortWithinPartitions 是一个高效的算子,比先调用 repartition , 再调用 sorting 在分组内排序效率要高,这是由于它的排序是在shuffle过程中进行,一边shuffle,一边排序;具体见 spark shuffle的读操作;
粗要的看下源码 :
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
/**
* Repartition the RDD according to the given partitioner and, within each resulting partition,
* sort records by their keys.
*
* This is more efficient than calling `repartition` and then sorting within each partition
* because it can push the sorting down into the shuffle machinery.
*/
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
}
下面以一个具体需求,看一下这个算子如何使用 :
例子1. 将rdd数据中相同班级的学生分到一个partition中,并根据分数降序排序
实现代码
package com.gaosi.spark.demo
/**
* Created by szh on 2019/9/19.
*/
import org.apache.spark.{SparkConf, SparkContext}
class Student {
}
//创建key类,key组合键为grade,score
case class StudentKey(grade: String, score: Int)
// extends Ordered[StudentKey] {
// def compare(that: StudentKey): Int = {
// var result: Int = this.grade.compareTo(that.grade)
// if (result == 0) {
// result = that.score.compareTo(this.score)
// }
// result
// }
//}
object StudentKey {
implicit def orderingByGradeStudentScore[A <: StudentKey]: Ordering[A] = {
Ordering.by(fk => (fk.grade, fk.score * -1))
}
}
//创建分区类
import org.apache.spark.Partitioner
class StudentPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
val k = key.asInstanceOf[StudentKey]
Math.abs(k.grade.hashCode()) % numPartitions
}
}
object Student {
def main(args: Array[String]) {
//定义hdfs文件索引值
val grade_idx: Int = 0
val student_idx: Int = 1
val course_idx: Int = 2
val score_idx: Int = 3
//定义转化函数,不能转化为Int类型的,给默认值0
def safeInt(s: String): Int = try {
s.toInt
} catch {
case _: Throwable => 0
}
//定义提取key的函数
def createKey(data: Array[String]): StudentKey = {
StudentKey(data(grade_idx), safeInt(data(score_idx)))
}
//定义提取value的函数
def listData(data: Array[String]): List[String] = {
List(data(grade_idx), data(student_idx), data(course_idx), data(score_idx))
}
def createKeyValueTuple(data: Array[String]): (StudentKey, List[String]) = {
(createKey(data), listData(data))
}
//设置master为local,用来进行本地调试
val conf = new SparkConf().setAppName("Student_partition_sort").setMaster("local")
val sc = new SparkContext(conf)
//学生信息是打乱的
val student_array = Array(
"c001,n003,chinese,59",
"c002,n004,english,79",
"c002,n004,chinese,13",
"c001,n001,english,88",
"c001,n002,chinese,10",
"c002,n006,chinese,29",
"c001,n001,chinese,54",
"c001,n002,english,32",
"c001,n003,english,43",
"c002,n005,english,80",
"c002,n005,chinese,48",
"c002,n006,english,69"
)
//将学生信息并行化为rdd
val student_rdd = sc.parallelize(student_array)
//生成key-value格式的rdd
val student_rdd2 = student_rdd.map(line => line.split(",")).map(createKeyValueTuple)
//根据StudentKey中的grade进行分区,并根据score降序排列
val student_rdd3 = student_rdd2.repartitionAndSortWithinPartitions(new StudentPartitioner(10))
//打印数据
student_rdd3.collect.foreach(println)
}
}
输出
(StudentKey(c001,88),List(c001, n001, english, 88))
(StudentKey(c001,59),List(c001, n003, chinese, 59))
(StudentKey(c001,54),List(c001, n001, chinese, 54))
(StudentKey(c001,43),List(c001, n003, english, 43))
(StudentKey(c001,32),List(c001, n002, english, 32))
(StudentKey(c001,10),List(c001, n002, chinese, 10))
(StudentKey(c002,80),List(c002, n005, english, 80))
(StudentKey(c002,79),List(c002, n004, english, 79))
(StudentKey(c002,69),List(c002, n006, english, 69))
(StudentKey(c002,48),List(c002, n005, chinese, 48))
(StudentKey(c002,29),List(c002, n006, chinese, 29))
(StudentKey(c002,13),List(c002, n004, chinese, 13))
这里我们注意2个点,
一是分区器
import org.apache.spark.Partitioner
class StudentPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
val k = key.asInstanceOf[StudentKey]
Math.abs(k.grade.hashCode()) % numPartitions
}
}
注意 hashCode 可能为负,所以要调用 Math.abs
第二个点 是 排序的实现
object StudentKey {
implicit def orderingByGradeStudentScore[A <: StudentKey]: Ordering[A] = {
Ordering.by(fk => (fk.grade, fk.score * -1))
}
}