在sortby中默认传入排序规则是
ascending true.升序
第一个参数是一个函数,该函数的也有一个带T泛型的参数,返回类型和RDD中元素的类型是一致的
第三个参数是numPartitions,该参数决定排序后的RDD的分区个数,默认排序后的分区个数和排序之前的个数相等,即为this.partitions.size。
有一个隐士转换,可以导入隐士转换也可以按照元组进行排序
val value1: RDD[(String, Int, Double)] = value.sortBy(x => (-x._3, x._2))
传入tuple two元组,元组的排序规则是先比第一个,再比第二个按照先后顺序.我们传入二个元组就先比-x._3的规则
这个是取降序,
元组源码:
def compare(x: (T1, T2), y: (T1, T2)): Int = {
val compare1 = ord1.compare(x._1, y._1)
if (compare1 != 0) return compare1
val compare2 = ord2.compare(x._2, y._2)
if (compare2 != 0) return compare2
0
}
先比第一个, comepare =1就是x._1>y._1, -1取反
sortBy是transformation,Action,产生shuffle问:为什么sortBy触发了Action
sortBy底层调用了keys,sortByKey
sortByKey中参数(默认升序,排序后的RDD的分区个数)
其中调取了RangePartitioner,
RangePartitioner 的rangeBounds进行了采样,
确定RDD中每一个元素shuffle后的存放的partition
每个partition中要抽取的样本数量
采样指定是否放回去和不放回去,true,false
放回来就可以把数据拿出来多次,而false就是拿出来了不放回去
该值由shuffle后的partitions与当前rdd的partitions数量共同决定,
样本都要抽取到driver中进行计算
这样从局部的每个分区的数据排序完成,那么全局的数据排序也就完成了
触发Action是由sketch完的,它调取了collect action算子,触发了job为什么sortBy后count触发shuffle,因为
sortBy是宽依赖算子,发生了shuffle,上游stage把rdd数据写入了临时文件
然后下游stage取读取,那么只要sparkContext不关闭,临时文件就一直存在
所以触发下一个job的时候,rdd根据依赖,会找到这些临时文件,起到了缓存的作用
-----------------------------------------------------------------------------------------------------------------------------------------------------
创建隐式转换,和类
object OrderingContext {
//隐式的参数(隐式object)
implicit object OrderingPerson extends Ordering[Person] {override def compare(x: Person, y: Person): Int = {
if(x.fv == y.fv) {
x.age - y.age
} else {
java.lang.Double.compare(y.fv, x.fv)
}
}
}implicit val orderPerson: Ordering[Person] = new Ordering[Person] {
override def compare(x: Person, y: Person): Int = {
if(x.fv == y.fv) {
x.age - y.age
} else {
java.lang.Double.compare(y.fv, x.fv)
}
}
}
}-----------------------------------------------------------------------------------------------------------------------------------------------------
创建Person类,
case class Person(name: String, age: Int, fv: Double)
-----------------------------------------------------------------------------------------------------------------------------------------------------
创建sortby排序代码
package cn._51doit.spark.day08
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object CustomSort3 {def main(args: Array[String]): Unit = {
val isLocal = args(0).toBoolean
val conf = new SparkConf().setAppName(this.getClass.getCanonicalName)
if (isLocal) {
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)val lines: RDD[String] = sc.parallelize(List("laoduan,30,99.99", "nianhang,28,99.99", "laozhao,18,9999.99"))
---导入隐士转换!!!! import OrderingContext.orderPersonval tfboy: RDD[Person] = lines.map(line => {
val fields = line.split(",")
val name = fields(0)
val age = fields(1).toInt
val fv = fields(2).toDouble
Person(name, age, fv)
})val sorted: RDD[Person] = tfboy.sortBy(x => x)
println(sorted.collect().toBuffer)
sc.stop()
}
}
-----------------------------------------------------------------------------------------------------------------------------------------------------或者使用元组的排序规则.元组的排序先比第一个,然后在比第二个,,是用compare进行比较,如果参数1比参数2大就返回0.如果不等于0就比较下一个
val compare1 = ord1.compare(x._1, y._1) if (compare1 != 0) return compare1 val compare2 = ord2.compare(x._2, y._2) if (compare2 != 0) return compare2 val compare3 = ord3.compare(x._3, y._3) if (compare3 != 0) return compare3
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object CustomSort4 {
def main(args: Array[String]): Unit = {
val isLocal = args(0).toBoolean
val conf = new SparkConf().setAppName(this.getClass.getCanonicalName)
if (isLocal) {
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)val lines: RDD[String] = sc.parallelize(List("laoduan,30,99.99", "nianhang,28,99.99", "laozhao,18,9999.99"))
val tfboy: RDD[(String, Int, Double)] = lines.map(line => {
val fields = line.split(",")
val name = fields(0)
val age = fields(1).toInt
val fv = fields(2).toDouble
(name, age, fv)
})val sorted: RDD[(String, Int, Double)] = tfboy.sortBy(t => (-t._3, t._2))
println(sorted.collect().toBuffer)
sc.stop()
}
}
sortBy如何排序和如何触发Action
最新推荐文章于 2023-11-26 11:44:31 发布