sortBy如何排序和如何触发Action

最新推荐文章于 2023-11-26 11:44:31 发布

茂密头发的源猴

最新推荐文章于 2023-11-26 11:44:31 发布

阅读量1.1k

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/weixin_48109576/article/details/108059334

版权

在sortby中默认传入排序规则是
ascending true.升序
第一个参数是一个函数，该函数的也有一个带T泛型的参数，返回类型和RDD中元素的类型是一致的
第三个参数是numPartitions，该参数决定排序后的RDD的分区个数，默认排序后的分区个数和排序之前的个数相等，即为this.partitions.size。
有一个隐士转换,可以导入隐士转换也可以按照元组进行排序
val value1: RDD[(String, Int, Double)] = value.sortBy(x => (-x._3, x._2))
传入tuple two元组,元组的排序规则是先比第一个,再比第二个按照先后顺序.我们传入二个元组就先比-x._3的规则
这个是取降序,
元组源码:
def compare(x: (T1, T2), y: (T1, T2)): Int = {
val compare1 = ord1.compare(x._1, y._1)
if (compare1 != 0) return compare1
val compare2 = ord2.compare(x._2, y._2)
if (compare2 != 0) return compare2
0
}
   先比第一个, comepare =1就是x._1>y._1, -1取反


sortBy是transformation,Action,产生shuffle

问:为什么sortBy触发了Action

sortBy底层调用了keys,sortByKey
sortByKey中参数(默认升序,排序后的RDD的分区个数)
其中调取了RangePartitioner,
RangePartitioner 的rangeBounds进行了采样,
确定RDD中每一个元素shuffle后的存放的partition
每个partition中要抽取的样本数量
采样指定是否放回去和不放回去,true,false
放回来就可以把数据拿出来多次,而false就是拿出来了不放回去
该值由shuffle后的partitions与当前rdd的partitions数量共同决定，
样本都要抽取到driver中进行计算
这样从局部的每个分区的数据排序完成,那么全局的数据排序也就完成了
触发Action是由sketch完的,它调取了collect action算子,触发了job

为什么sortBy后count触发shuffle,因为

sortBy是宽依赖算子,发生了shuffle,上游stage把rdd数据写入了临时文件
然后下游stage取读取,那么只要sparkContext不关闭,临时文件就一直存在
所以触发下一个job的时候,rdd根据依赖,会找到这些临时文件,起到了缓存的作用

-----------------------------------------------------------------------------------------------------------------------------------------------------

创建隐式转换,和类

object OrderingContext {

//隐式的参数（隐式object）
implicit object OrderingPerson extends Ordering[Person] {

override def compare(x: Person, y: Person): Int = {
if(x.fv == y.fv) {
x.age - y.age
} else {
java.lang.Double.compare(y.fv, x.fv)
}
}
}

implicit val orderPerson: Ordering[Person] = new Ordering[Person] {

override def compare(x: Person, y: Person): Int = {
if(x.fv == y.fv) {
x.age - y.age
} else {
java.lang.Double.compare(y.fv, x.fv)
}
}
}
}

-----------------------------------------------------------------------------------------------------------------------------------------------------

创建Person类,

case class Person(name: String, age: Int, fv: Double)

-----------------------------------------------------------------------------------------------------------------------------------------------------

创建sortby排序代码

package cn._51doit.spark.day08

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object CustomSort3 {

def main(args: Array[String]): Unit = {

val isLocal = args(0).toBoolean

val conf = new SparkConf().setAppName(this.getClass.getCanonicalName)

if (isLocal) {
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)

val lines: RDD[String] = sc.parallelize(List("laoduan,30,99.99", "nianhang,28,99.99", "laozhao,18,9999.99"))
---导入隐士转换!!!! import OrderingContext.orderPerson

val tfboy: RDD[Person] = lines.map(line => {
val fields = line.split(",")
val name = fields(0)
val age = fields(1).toInt
val fv = fields(2).toDouble
Person(name, age, fv)
})

val sorted: RDD[Person] = tfboy.sortBy(x => x)

println(sorted.collect().toBuffer)

sc.stop()

}

}
-----------------------------------------------------------------------------------------------------------------------------------------------------

或者使用元组的排序规则.元组的排序先比第一个,然后在比第二个,,是用compare进行比较,如果参数1比参数2大就返回0.如果不等于0就比较下一个
val compare1 = ord1.compare(x._1, y._1)
if (compare1 != 0) return compare1
val compare2 = ord2.compare(x._2, y._2)
if (compare2 != 0) return compare2
val compare3 = ord3.compare(x._3, y._3)
if (compare3 != 0) return compare3
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object CustomSort4 {

def main(args: Array[String]): Unit = {

val isLocal = args(0).toBoolean

val conf = new SparkConf().setAppName(this.getClass.getCanonicalName)

if (isLocal) {
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)

val lines: RDD[String] = sc.parallelize(List("laoduan,30,99.99", "nianhang,28,99.99", "laozhao,18,9999.99"))

val tfboy: RDD[(String, Int, Double)] = lines.map(line => {
val fields = line.split(",")
val name = fields(0)
val age = fields(1).toInt
val fv = fields(2).toDouble
(name, age, fv)
})

val sorted: RDD[(String, Int, Double)] = tfboy.sortBy(t => (-t._3, t._2))

println(sorted.collect().toBuffer)

sc.stop()

}

}

茂密头发的源猴

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
sortBy如何排序和如何触发Action

在sortby中默认传入排序规则是ascending true.升序第一个参数是一个函数，该函数的也有一个带T泛型的参数，返回类型和RDD中元素的类型是一致的第三个参数是numPartitions，该参数决定排序后的RDD的分区个数，默认排序后的分区个数和排序之前的个数相等，即为this.partitions.size。有一个隐士转换,可以导入隐士转换也可以按照元组进行排序 val value1: RDD[(String, Int, Double)] = value.sortBy(x =&...
复制链接

扫一扫