自定义排序
使用rdd的sortBy或sortByKey方法进行排序,实现根据需求实现灵活的排序规则,在给定的数据中,有三个字段,分别是:姓名,年龄,颜值。先按照颜值的从高到低进行排序,如果颜值相等,再安装年龄的升序排序,如果不指定排序规则,直接对Bean排序:sortBy(x => x)会报错
数据
name,age,fv
taoge,30,99.99
xingge,18,9999.99
hangge,29,99.99
错误示范
class BoyBean (val name :String,var age :Int,val fv:Double){
}
package com.doit.spark.day08
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object SortDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")
val setMaster = args(0).toBoolean
if (setMaster){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val textFile = args(1)
val line: RDD[String] = sc.textFile(textFile)
val maped: RDD[BoyBean] = line.map(line => {
val sp: Array[String] = line.split(",")
val name = sp(0)
var age = sp(1).toInt
val fv = sp(2).toDouble
new BoyBean(name, age, fv)
})
val sorted: RDD[BoyBean] = maped.sortBy(x => x)
println(sorted.collect().toBuffer)
}
}
实现排序方式一:
在Bean上实现Comparable或Ordered接口(因为底层Ordered实现了Comparable接口),然后重写里面的排序方法,注意这两个接口需要指定范型类型,同时因为我们的Bean里面封装了数据,之后要走网络并写磁盘,所以还需要实现序列化接口Serializable(该序列化接口默认序列化类型为java的序列化类型,占用内存空间多,但读写速度块)
package com.doit.spark.day08
class BoyBean (val name :String,var age :Int,val fv:Double) extends Comparable[BoyBean] with Serializable {
override def compareTo(o: BoyBean): Int = {
if (this.fv == o.fv){
this.age - o.age
}else{
java.lang.Double.compare(o.fv,this.fv)
}
}
override def toString: String = s"Boy($name, $age, $fv)"
}
我们还可以自定义序列化规则,然后让Bean注册该序列化规则,使占用内存变小,自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
package com.doit.spark.day08
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object SortDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")
//自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")
//BoyBean注册该序列化接口
conf.registerKryoClasses(Array(classOf[BoyBean]))
val setMaster = args(0).toBoolean
if (setMaster){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val textFile = args(1)
val line: RDD[String] = sc.textFile(textFile)
val maped: RDD[BoyBean] = line.map(line => {
val sp: Array[String] = line.split(",")
val name = sp(0)
var age = sp(1).toInt
val fv = sp(2).toDouble
new BoyBean(name, age, fv)
})
val sorted: RDD[BoyBean] = maped.sortBy(x => x)
println(sorted.collect().toBuffer)
}
}
如果不想实现序列化接口和toString方法,可以定义一个case class Bean,因为它里面默认实现了序列化接口,和toString方法,还有get,set方法
实现排序方式二:
使用隐式转换实现一个排序规则
在调用rdd的sortBy方法前导入隐式转换
package com.doit.spark.day08
case class BoyBean (val name :String,var age :Int,val fv:Double) extends Serializable {
override def toString: String = s"Boy($name, $age, $fv)"
}
object BoySortBy {
//自定义一个object类型的隐式参数BuyOrdering
implicit object BoyOrdering extends Ordering[BoyBean]{
override def compare(x: BoyBean, y: BoyBean): Int = {
if (x.fv == y.fv){
x.age - y.age
}else{
java.lang.Double.compare(y.fv,x.fv)
}
}
}
}
package com.doit.spark.day08
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object SortDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")
//自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")
//BoyBean注册该序列化接口
conf.registerKryoClasses(Array(classOf[BoyBean]))
val setMaster = args(0).toBoolean
if (setMaster){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val textFile = args(1)
val line: RDD[String] = sc.textFile(textFile)
val maped: RDD[BoyBean] = line.map(line => {
val sp: Array[String] = line.split(",")
val name = sp(0)
var age = sp(1).toInt
val fv = sp(2).toDouble
new BoyBean(name, age, fv)
})
//导入隐式参数
import BoySortBy.BoyOrdering
val sorted: RDD[BoyBean] = maped.sortBy(x => x)
println(sorted.collect().toBuffer)
}
}
实现排序方式三:
数据存储在元组或Bean中,可以直接在SortBy算子中指定要排序的字段:
package com.doit.spark.day08
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object SortDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")
//自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")
//BoyBean注册该序列化接口
conf.registerKryoClasses(Array(classOf[BoyBean]))
val setMaster = args(0).toBoolean
if (setMaster){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val textFile = args(1)
val line: RDD[String] = sc.textFile(textFile)
val maped: RDD[BoyBean] = line.map(line => {
val sp: Array[String] = line.split(",")
val name = sp(0)
var age = sp(1).toInt
val fv = sp(2).toDouble
new BoyBean(name, age, fv)
})
//直接指定按颜值降序,再按年龄升序,元组类型同理
val sorted: RDD[BoyBean] = maped.sortBy(x => (-x.fv,x.age))
println(sorted.collect().toBuffer)
}
}
实现排序方式四:
直接在sortBy算子的上方定义自定义排序的隐式参数:
package com.doit.spark.day08
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object SortDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")
//自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")
//BoyBean注册该序列化接口
conf.registerKryoClasses(Array(classOf[BoyBean]))
val setMaster = args(0).toBoolean
if (setMaster){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val textFile = args(1)
val line: RDD[String] = sc.textFile(textFile)
val maped: RDD[BoyBean] = line.map(line => {
val sp: Array[String] = line.split(",")
val name = sp(0)
var age = sp(1).toInt
val fv = sp(2).toDouble
new BoyBean(name, age, fv)
})
//自定义排序的隐式参数
implicit val ord =Ordering[(Double,Int)].on[BoyBean](x => (-x.fv,x.age))
val sorted: RDD[BoyBean] = maped.sortBy(x => x)
println(sorted.collect().toBuffer)
}
}